Great question! Our priority is making sure users have access to their money. An ‘incident’ can range from feed items not being created in real time, to a full blown outage where all payments decline. We have a few things we can do internally to minimise user impact, depending on the severity of the incident. What we can do also depends very much if it’s an internal incident or from another company that we interact with (like Mastercard or other banks).
Ultimately if there’s something impacting all of our user base, and they can’t access their money for whatever reason, we would consider sending a message with a push notification to all users letting them know what’s happening, so they’re not caught off guard. It might seem like an overkill to many users, but people will be sitting in a restaurant without being able to pay their bill, or at the supermarket till with a trolley full of groceries (and might not even have food at home). So first line of action is being transparent about what’s happening and keeping users looped in while the engineering teams resolve the issue.
Internally we would immediately start publishing overtime shifts that anyone can pick up and jump straight to the conversation queues, and for the following days if we expect the issue to last longer. Being a customer centric company everyone at Monzo is also trained on how to use our internal tools and answer customer queries. So if things are really bad, we’d ask other teams to jump on the customer support queue.
In less serious incidents we have other tools we’d use to minimise user impact, or COps impact (from the additional influx of queries). Something very straightforward to do is just adding new content to the help screen. ‘Emergency’ content appears at the top of the suggested articles (below the search bar), also appears on the top of the search results, and also in the screen where users first type their message as one of the ‘suggestions’. This is very effective at answering questions that do not impact user’s money, but might create confusion in app for some reason (like a duplicate feed item, or a slightly delayed bank transfer). These are the issues that there is no action from our COps team except explaining what’s happening, which the help screen is also effective at doing.
We also use StatusPage to publish incidents. If we add an incident on StatusPage users that signed up for SMS notifications on their website would be notified (albeit it’s a very small number of people). StatusPage also allows us to set a severity, and depending on what we choose it can automatically add an alert banner throughout the Monzo app, with the description of what’s happening, and enable an alert box that would appear to anyone that tries to contact us during the incident (to tell users we’re aware of the issue and they don’t need to get in touch, but they can still do it).
What’s really interesting is that from each incident we do a “post-mortem” to discuss how we could have dealt with it better. More often than not we realise we could change our systems to stop some of the issues before they affect all of our users. As we see more of these we become better at preventing large scale incidents, and at minimising user and COps impact.