We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future

DaveTMG · 20 June 2019 15:01

But if the die is 20k sided and you have 10 million runs, all will trigger… yes it’s an odds game, but with over 2 million customers, how many runs do you have per second per instance?

Drew58 · 20 June 2019 15:28

Good to see information sharing. People have come to expect near instant transfers with faster payment even though the only guarantees are funds will be credited by the end of the next business day.

SouthseaOne · 20 June 2019 16:08

No load balancer balances load totally evenly

No server runs at the exact same speed as an other server, even from the same batch

I have seen this sort of thing with the embedded real time software we work on very rarely and it is the worst sort of thing to find and fix

Demmedelusive · 20 June 2019 16:10

I’m in mental health care, so the content of this thread is totally alien to me, and not surprisingly, makes no real sense…

So why does it make such fascinating reading, that’s what I’d like to know

SouthseaOne · 20 June 2019 16:14

Because it feels a bit like magic?

It is all too often like that even when you do work with it!

anon41613057 · 20 June 2019 16:19

Please do ask any questions! Perhaps we can edit the post to clarify for others too

Demmedelusive · 20 June 2019 16:25

Nothing wrong with the initial post, @nickrw. That hits the spot.

The subsequent conversation between those in the know - well that defies laymen’s understanding. And so it probably should.

I’ll just continue to read it and pretend I understand…

Feathers · 20 June 2019 17:28

I must admit, I’m also surprised that they’re not validating and range checking the data they’re putting out of their systems. I can perhaps understand it internally but when crossing the boundary to a bank or to the hub I might have expected some checks.

Checking that the thing that is supposed to be a date in a certain format looks like a date in that format is a good idea!

(Now trying desperately to remember what we do on aircraft interfaces in case I’m ‘dropping myself in it’!)

walderston · 20 June 2019 17:45

It’s not always feasible to add validation to every field.

Or you add filters to one process and it introduces other issues which were not taken into account.

Feathers · 20 June 2019 17:51

True. Or simply screws up the timing if it’s on a tight schedule.

rbirkby · 20 June 2019 21:26

Use of the term “unsafe” in the post implies to me they have tried to optimise C# code with unsafe blocks - in general a bad idea.

DaveTMG · 21 June 2019 05:48

True, I’m not arguing with that at all. My point remains - work out why THAT server lost the race and it will potentially tell you what the problem is.

anon79863048 · 21 June 2019 07:29

Well done! This is an excellent write up that everyone can understand. Thing do go wrong sometimes and transparency is everything. Great work

SouthseaOne · 21 June 2019 08:52

Can only talk to my personal experience, but I have found that quest to be the path to insanity

We tend to replace the area in question so that no sort of problem like that can occur to any node at any time

The old adage about not preemptively optimising (to which I would add redesigning) still holds though

This sort of thing is a trigger for us to fully re-evaluate an area, maybe in toto

DaveTMG · 21 June 2019 08:53

If you don’t drill down to the root cause, how do you know you’ve fixed it rather than simply shifted it to another set of conditions?

SouthseaOne · 21 June 2019 08:58

Maybe I have been doing this too long and in too niche a field, but not every last time can you find and fix the root cause

I’ve had to cope with maybe a dozen plus really buried issues and through persistence, stubbornness and luck I’ve addressed pretty much all of them with laser focused fixes

However along the way we’ve improved so much legacy code throughout to be more defensive and better structured that the value is really there

SouthseaOne · 21 June 2019 08:59

And of course don’t do anything really stupid like running “unsafe” code when using a GC language

Find a better way to speed it up

SouthseaOne · 21 June 2019 09:05

Updated my response above now I’m on my train and had cause to reflect

I agree you should always try to find the root cause, but what I was trying to get at was you cannot always fix it

Maybe it is in another application, another hardware device, another building, another company even

Sometimes burying the problem under a mountain of love is the best you can do barring something more radical

Abroadley007 · 22 June 2019 22:20

Cheers for all the info on this case, great that you’ve thoroughly explained what happened. Came at a bad time for me, but I still have trust in Monzo after dealing with it so well.

blahaj · 29 June 2019 07:17

I had the same thought as soon as I saw the bit about the gateway’s transformer using an unsafe method to access memory! Definitely wouldn’t be an issue in Rust

Topic		Replies	Views
✅ Delayed transactions (30/05/19) Monzo Chat	99	5327	24 June 2019
We’re experiencing problems with some bank transfers (Update : Issue Resolved) News & Updates	48	4916	31 May 2019
Bank Transfers won't work on Saturday 2nd November from 6am-9am Monzo Chat	80	9730	2 November 2019
Wording of the status message 4/8/2018 Feedback & Ideas	44	2865	17 August 2018
We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it News & Updates	69	6634	23 December 2019

We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future

Related topics