We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future

On the 30th of May, the company we use to connect to Faster Payments had a technical issue, which meant a quarter of bank transfers into Monzo accounts were failing or delayed by several hours, and bank transfers from Monzo accounts were delayed by a few minutes.

There was an issue with a system that this company uses to translate payment messages from one internal format to another (they call this a “transformer”). This issue corrupted payment messages, which meant we couldn’t process bank transfers properly.

One of our payments engineers @nickrw has given a high level overview and broken down in technical detail what went wrong. As well as sharing our plans to make sure it doesn’t happen again in the future.

Again, we’re really sorry about what went wrong here. And rest assured we’re committed to fixing it.

39 Likes

Appreciate the transparency with these posts. Also as a software engineer in a very different field they always make a fascinating read

13 Likes

This is why I bank with you guys. Open and completely honest. Thanks for the breakdown.

11 Likes

Excellent write up, enjoyed reading it.

Please do more articles on how things work, doesn’t have to just be when something goes wrong.

10 Likes

Excellent write up. As ever, pleased to be a customer and investor for a bank that explains these issues clearly and in full detail

3 Likes

Really appreciate reading this. I wasn’t affected but that’s a fantastic insight into your ITSM and particularly incident and problem management processes. I know some people were unhappy about the timescales with fixing this - but complex incidents can take time to fix, particularly when you’re working with third party service providers. This level of detail and transparency from a bank is great.

2 Likes

Interesting. Why did only one transformer fail? Wouldn’t there be a load balancer in front of them?

2 Likes

Great article describing the problem in detail and how you are going to try and ensure it doesn’t happen again in the future. This is the sort of response you want from a bank if they have an outage!

2 Likes

As a complete layperson, and someone who wasn’t affected by the problem, this was an interesting read. Thank you.

4 Likes

My understanding of our partner’s infrastructure is incomplete, but I do know the transformer code has an extremely rare race condition (this is the first time it has triggered), which as yet has not been identified. The fix that they rolled out is not to prevent this race condition from happening again, as they are not yet certain how the race occurs, but to handle it gracefully it if it does. It would mean one payment would be declined, but subsequent payments would not be corrupted.

When the transformer lost the race on the 30th, it entered a state where it was corrupting every subsequent payment message through it. From the transformer’s perspective, it was not failing, and there wasn’t any validation that the output from the transformer was well formed. I believe there is some kind of load balancing between the transformers, but this transformer was not taken out of service automatically as it was considered healthy.

9 Likes

If only they had used Rust, then they wouldn’t have to worry about race conditions :stuck_out_tongue:. Joking aside I really enjoy reading these sort of detailed articles, and this transparency is the reason I bank with you :slight_smile:

2 Likes

I really appreciate you being so transparent with your issues and also making it very clear how you’re putting your users first and solving the problem, thanks all!

2 Likes

This is where the 25% comes from, they had 4 instances and one was misbehaving. A load balancer can only route traffic away from an unhealthy instance if it knows it’s unhealthy, in this case the corruption wasn’t being detected so it was never taken off

My point was that the article said “When the program was operating under load, the system tried to clear memory it believed to be unused” - so why was that the only one under load? Why weren’t they all under approximately the same load if there was a load balancer? And if they were, why was that the only one to fail?

From Richard’s earlier comments I assume it was due to the unidentified but rare race condition which, presumably, only one of the four instances triggered. Load sharing is fine but it doesn’t mean all four instances were in exactly the same state at any time so the trigger condition was presumably avoided by the others.

1 Like

More payments going through -> eventually more GC runs. My understanding is the race condition was a race against the GC. The more GC runs, the more opportunities the race had to occur. Each instance of this program was equally likely to experience it.

1 Like

What interests me is why they only restarted one of the pair of server sets in the identified location. I don’t know if a restart would have solved the issue in the other servers but taking them down for a bit as part of the restart cycle would surely have positively identified them as the source of the problem?

1 Like

Must be fate I have ended up speeding the whole afternoon at work dealing with MT940 files. So maybe not that different of a filed after all aha

But that statistically cannot be - if all four instances are identical, and all are equally likely to have the required load, then the race should be approximately equally distributed as you have thousands of runs. Yet they said it only affected one.

Find why that particular machine suffered and the others didn’t and you’ve found the cause of the race.

Every time you GC, roll a 20,000 sided die. If it lands on 1, then start corrupting messages.

That instance of the program lost the roll when it GC’d, but just because it did didn’t change the odds that any other of the instances of it would get unlucky. I don’t believe the load between them was uneven.

(I’ve plucked 20k out of the air there, not sure what the actual odds were)