We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future

bea · 20 June 2019 11:15

On the 30th of May, the company we use to connect to Faster Payments had a technical issue, which meant a quarter of bank transfers into Monzo accounts were failing or delayed by several hours, and bank transfers from Monzo accounts were delayed by a few minutes.

There was an issue with a system that this company uses to translate payment messages from one internal format to another (they call this a “transformer”). This issue corrupted payment messages, which meant we couldn’t process bank transfers properly.

One of our payments engineers @nickrw has given a high level overview and broken down in technical detail what went wrong. As well as sharing our plans to make sure it doesn’t happen again in the future.

Again, we’re really sorry about what went wrong here. And rest assured we’re committed to fixing it.

TomCoutts · 20 June 2019 11:32

Appreciate the transparency with these posts. Also as a software engineer in a very different field they always make a fascinating read

adztaylor · 20 June 2019 11:49

This is why I bank with you guys. Open and completely honest. Thanks for the breakdown.

phildawson · 20 June 2019 11:52

Excellent write up, enjoyed reading it.

Please do more articles on how things work, doesn’t have to just be when something goes wrong.

SouthseaOne · 20 June 2019 12:03

Excellent write up. As ever, pleased to be a customer and investor for a bank that explains these issues clearly and in full detail

projectfortytwo · 20 June 2019 12:05

Really appreciate reading this. I wasn’t affected but that’s a fantastic insight into your ITSM and particularly incident and problem management processes. I know some people were unhappy about the timescales with fixing this - but complex incidents can take time to fix, particularly when you’re working with third party service providers. This level of detail and transparency from a bank is great.

DaveTMG · 20 June 2019 12:08

Interesting. Why did only one transformer fail? Wouldn’t there be a load balancer in front of them?

Marksummers · 20 June 2019 12:24

Great article describing the problem in detail and how you are going to try and ensure it doesn’t happen again in the future. This is the sort of response you want from a bank if they have an outage!

j06 · 20 June 2019 12:35

As a complete layperson, and someone who wasn’t affected by the problem, this was an interesting read. Thank you.

nickrw · 20 June 2019 13:00

My understanding of our partner’s infrastructure is incomplete, but I do know the transformer code has an extremely rare race condition (this is the first time it has triggered), which as yet has not been identified. The fix that they rolled out is not to prevent this race condition from happening again, as they are not yet certain how the race occurs, but to handle it gracefully it if it does. It would mean one payment would be declined, but subsequent payments would not be corrupted.

When the transformer lost the race on the 30th, it entered a state where it was corrupting every subsequent payment message through it. From the transformer’s perspective, it was not failing, and there wasn’t any validation that the output from the transformer was well formed. I believe there is some kind of load balancing between the transformers, but this transformer was not taken out of service automatically as it was considered healthy.

SamP20 · 20 June 2019 13:15

If only they had used Rust, then they wouldn’t have to worry about race conditions . Joking aside I really enjoy reading these sort of detailed articles, and this transparency is the reason I bank with you

gcardoso · 20 June 2019 14:16

I really appreciate you being so transparent with your issues and also making it very clear how you’re putting your users first and solving the problem, thanks all!

albertmonzo · 20 June 2019 14:20

This is where the 25% comes from, they had 4 instances and one was misbehaving. A load balancer can only route traffic away from an unhealthy instance if it knows it’s unhealthy, in this case the corruption wasn’t being detected so it was never taken off

DaveTMG · 20 June 2019 14:30

My point was that the article said “When the program was operating under load, the system tried to clear memory it believed to be unused” - so why was that the only one under load? Why weren’t they all under approximately the same load if there was a load balancer? And if they were, why was that the only one to fail?

Feathers · 20 June 2019 14:36

From Richard’s earlier comments I assume it was due to the unidentified but rare race condition which, presumably, only one of the four instances triggered. Load sharing is fine but it doesn’t mean all four instances were in exactly the same state at any time so the trigger condition was presumably avoided by the others.

nickrw · 20 June 2019 14:38

More payments going through -> eventually more GC runs. My understanding is the race condition was a race against the GC. The more GC runs, the more opportunities the race had to occur. Each instance of this program was equally likely to experience it.

Feathers · 20 June 2019 14:42

What interests me is why they only restarted one of the pair of server sets in the identified location. I don’t know if a restart would have solved the issue in the other servers but taking them down for a bit as part of the restart cycle would surely have positively identified them as the source of the problem?

TomCoutts · 20 June 2019 14:50

Must be fate I have ended up speeding the whole afternoon at work dealing with MT940 files. So maybe not that different of a filed after all aha

DaveTMG · 20 June 2019 14:57

But that statistically cannot be - if all four instances are identical, and all are equally likely to have the required load, then the race should be approximately equally distributed as you have thousands of runs. Yet they said it only affected one.

Find why that particular machine suffered and the others didn’t and you’ve found the cause of the race.

nickrw · 20 June 2019 14:59

Every time you GC, roll a 20,000 sided die. If it lands on 1, then start corrupting messages.

That instance of the program lost the roll when it GC’d, but just because it did didn’t change the odds that any other of the instances of it would get unlucky. I don’t believe the load between them was uneven.

(I’ve plucked 20k out of the air there, not sure what the actual odds were)

Topic		Replies	Views
We’re experiencing problems with some bank transfers (Update : Issue Resolved) News & Updates	49	4626	31 May 2019
✅ Delayed transactions (30/05/19) Monzo Chat	103	4695	21 December 2019
Monzo Stability Monzo Chat	46	3646	31 January 2020
Monzo Outage 13/08/2024 [Resolved] Monzo Chat	42	1415	9 February 2025
Card payments may fail and bank transfers may fail or be delayed 23/04/2018 (Resolved) Monzo Chat	49	3020	20 October 2018

We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future

Related topics