Just a quick FYI in case anyone misses the notification in app, I’ve just had the below…
Just raised feedback idea for this
I wonder if this is them moving over to their own connection rather than using a third party?
Pretty certain given that it is given that it has been tested and November was mentioned as the window
I am sure that @nickrw can confirm
but shhhh, you saw nothing…
Yep! This is our migration to our own FPS gateway, send us good vibes and coffee for that 6AM start
For anyone who doesn’t know, we committed to building our own gateway due to some outages with a 3rd party a few months ago. If anyone has any questions, technical or otherwise I’m happy to answer them
If the change to using your own internal card processor is anything to go by this will be a great step forward.
If anyone’s wondering what the ramifications of this are, what we learned from the 3rd party outages a few months ago was that we need to be fully in control of this aspect of our infrastructure to avoid that happening in the future, or to be able to instantly identify an issue and fix it ourselves. Part of the problem with those outages was that it took time for them to identify the issue.
I’ve pinned this topic globally to the top of the forum so that more people will see it.
Just the one @LewiLewiLewi asked on Twitter:
What happened at 15:12?
I find that where I work, things often get lost in translation between parties or the information is not transmitted fast enough.
If we had all the information at our fingertips rather than going through third parties we’d discover the route cause much faster.
Here’s a question. Do you have specific requirements for redundancy? (e.g. running multiple instances in case of failure, running in multiple physical locations in case of service loss etc?)
Are such requirements owned by Monzo or externally applied by some sort of industry body?
I’m assuming the answer is yes and I’m also assuming this has to cover hardware collapse as well as software going bonkers so may imply different physical infrastructure but I’m interested to know ‘for real’.
Payment cannot be applied because of beneficiary sensitivities
I really want to know what that means
I’ve replied to that tweet now, but the short answer is we turned off one of the datacentres running our gateway, while running a load test through it. In this scenario all the traffic should be taken up by the other datacentre.
Why can’t the down time happen at some time where most people are sleep.
(Just curious )
Yep, geographically distinct data centres through which our connections to FPS arrive, multiple instances of our gateway application on separate hardware in each datacentre, redundant links from the data centre to AWS etc.
A mix of both. Pay.UK (the scheme operator for FPS) does have minimum requirements that you have to prove as part of the certification process. We had to run several multi-hour load tests connected to a test environment they provided and prove we can survive different failure scenarios at our 5-year projected peak load without dropping messages or increasing latency. They’re fairly basic though, things like “turn one DC off”, “the central system suffers a partial failure reducing your number of connections to it”.
We passed certification last month, and have been spending the time since then tidying things up and doing our own, more specific, failure testing.
Very boring question but what do you use to create your dashboards We have tried a few solutions and I have never found one I liked.
Good question! A few different things fed into the decision on when we do it:
- Standing orders are typically still being sent at that time, so we don’t want to migrate between 1-3am
- How many payments are typically sent at various times of the day, on Saturdays
- How long will the maintenance actually take - we’re not expecting it to take 3 hours, but have this long available just in case.
- The alertness of our team performing the migration in the early hours of the morning
- How many customer support staff we have available
We found 6am to be the best trade off of all these things.
This is Grafana, backing on to Prometheus.