We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future

But if the die is 20k sided and you have 10 million runs, all will trigger… yes it’s an odds game, but with over 2 million customers, how many runs do you have per second per instance?

Good to see information sharing. People have come to expect near instant transfers with faster payment even though the only guarantees are funds will be credited by the end of the next business day.

1 Like

No load balancer balances load totally evenly

No server runs at the exact same speed as an other server, even from the same batch

I have seen this sort of thing with the embedded real time software we work on very rarely and it is the worst sort of thing to find and fix

2 Likes

I’m in mental health care, so the content of this thread is totally alien to me, and not surprisingly, makes no real sense…

So why does it make such fascinating reading, that’s what I’d like to know :thinking::flushed:

7 Likes

Because it feels a bit like magic?

It is all too often like that even when you do work with it!

2 Likes

Please do ask any questions! Perhaps we can edit the post to clarify for others too :slightly_smiling_face:

Nothing wrong with the initial post, @nickrw. That hits the spot.

The subsequent conversation between those in the know - well that defies laymen’s understanding. And so it probably should.

I’ll just continue to read it and pretend I understand…:grin:

7 Likes

I must admit, I’m also surprised that they’re not validating and range checking the data they’re putting out of their systems. I can perhaps understand it internally but when crossing the boundary to a bank or to the hub I might have expected some checks.

Checking that the thing that is supposed to be a date in a certain format looks like a date in that format is a good idea!

(Now trying desperately to remember what we do on aircraft interfaces in case I’m ‘dropping myself in it’!)

4 Likes

It’s not always feasible to add validation to every field.

Or you add filters to one process and it introduces other issues which were not taken into account.

True. Or simply screws up the timing if it’s on a tight schedule.

Use of the term “unsafe” in the post implies to me they have tried to optimise C# code with unsafe blocks - in general a bad idea.

1 Like

True, I’m not arguing with that at all. My point remains - work out why THAT server lost the race and it will potentially tell you what the problem is.

1 Like

Well done! This is an excellent write up that everyone can understand. Thing do go wrong sometimes and transparency is everything. Great work :heart_eyes:

3 Likes

Can only talk to my personal experience, but I have found that quest to be the path to insanity

We tend to replace the area in question so that no sort of problem like that can occur to any node at any time

The old adage about not preemptively optimising (to which I would add redesigning) still holds though

This sort of thing is a trigger for us to fully re-evaluate an area, maybe in toto

If you don’t drill down to the root cause, how do you know you’ve fixed it rather than simply shifted it to another set of conditions?

Maybe I have been doing this too long and in too niche a field, but not every last time can you find and fix the root cause

I’ve had to cope with maybe a dozen plus really buried issues and through persistence, stubbornness and luck I’ve addressed pretty much all of them with laser focused fixes

However along the way we’ve improved so much legacy code throughout to be more defensive and better structured that the value is really there

And of course don’t do anything really stupid like running “unsafe” code when using a GC language

Find a better way to speed it up

Updated my response above now I’m on my train and had cause to reflect

I agree you should always try to find the root cause, but what I was trying to get at was you cannot always fix it

Maybe it is in another application, another hardware device, another building, another company even

Sometimes burying the problem under a mountain of love is the best you can do barring something more radical

1 Like

Cheers for all the info on this case, great that you’ve thoroughly explained what happened. Came at a bad time for me, but I still have trust in Monzo after dealing with it so well.

1 Like

I had the same thought as soon as I saw the bit about the gateway’s transformer using an unsafe method to access memory! Definitely wouldn’t be an issue in Rust :grin: