Hi everyone. I’m a backend engineer at Monzo and I’m here to provide an overview of what happened last week when we experienced an outage. Last week, we experienced problems that affected both the prepaid and current account for around 30 minutes. Similar to the last major incident we experienced, I’d like to provide an overview on the causes of this outage, the timeline, and the steps we’ve taken to prevent similar events from happening again.
Unlike the outage in October 2017 in which our entire platform was unavailable, the outage last week was more localised. @oliver, our Head of Engineering, has previously posted an overview of our architecture on our blog (which is a great read) but here’s a rundown of the parts that were relevant during the outage last week.
Our microservice architecture requires us to be able to pass messages between different services (of which we have over 400). A message can represent something as simple as “send an email to a customer” or as complex as “apply this Direct Debit instruction to an account”. Within our system, we pass messages in many different ways. For time-sensitive messages, we might make a synchronous RPC directly to the service. For passing asynchronous messages, which are less time-dependent, we use two different message-passing systems, NSQ and Kafka. Both messaging systems have different use-cases and characteristics. This outage was caused by our use of NSQ.
When a service publishes messages to NSQ it only waits for an acknowledgement that the message was sent to NSQ successfully. The service doesn’t need confirmation that the message has been consumed (received) by another service, and in fact this may happen much later – this is why it is “asynchronous.” If messages are being sent faster than the downstream services can consume them, they are queued and consumed over a longer period of time.
Most messages are multicast to more than one service. Different services might process the same message in different ways and over different periods of time. We can also pause message processing for a particular service. For example, if we notice an issue with a service we can stop it from processing any more messages until we fix it.
It is important to note that while NSQ is an important part of our message passing system, it is not a critical component. By design, most messages can be delayed for delivery for an indefinite amount of time. Our core services are designed to work without reliance on NSQ, and this decoupling allowed many of our systems to continue operating through this outage.
On Monday 15th January, we were alerted to higher-than-normal latencies when publishing messages to NSQ. For our users, this resulted in transactions being delayed before appearing in the app and delayed push notifications. We started investigating and mitigated these high latencies by “pausing” less time-sensitive services. Within half an hour we were back to normal operation and no further customer impact was experienced on Monday.
After the incident, we formed a team dedicated to finding and fixing the underlying cause for this increased latency. The cause was found to be that NSQ was overloaded at peak times. We had not increased the capacity of this cluster recently, and with the increase in customers and transactions, it could not keep up.
With our NSQ topology (and more generally in our platform), we can take nodes offline without any downtime. By Monday evening, we had upgraded a single node that formed part of our NSQ cluster and we monitored its performance for signs of improvement.
The team decided the change was effective and decided to proceed upgrading the rest of the nodes. We were pausing and unpausing individual, non-critical message channels during the upgrade in order to test the new nodes.
At 14:37 our COps (Customer Operations) team noticed an increase in support requests and escalated this to the engineering team. At this point we were experiencing a partial outage. While payments continued to process normally, customers stopped receiving transactions, push notifications, and some other updates in their apps.
At 14:54 we identified that we had accidentally stopped message processing of all messages being queued to NSQ. Pausing some of these consumers has no impact on customers as messages can be delayed and processed later, but some of the consumers that had been accidentally paused were necessary to update customers’ apps. This caused a backlog of tens of millions of messages within minutes.
At 15:00, we unpaused all message consumption. As our services started to process the huge backlogs, some became saturated with messages. We had distributed millions of messages into our own services, effectively a self-DDoS, and this was when the problems became more serious: some customers experienced failures to load the app and a small number of payments began to fail.
Because the NSQ cluster was experiencing such extreme load, it was unresponsive to our commands to pause message processing to alleviate the load. By 15:30 we did succeed in getting the load under control, and this restored most functionality for customers.
At this time, our team worked on getting our NSQ cluster into a state where we were ready to resume message delivery. However, due to message-saturation issues described earlier, we could not unpause all channels at the same time, but had to unpause channels individually and monitor performance of downstream services. In order to mitigate the effect of this on our customers, we created a new NSQ cluster and directed our services to use this new cluster for new messages.
By 18:30 we were processing all new messages normally and the backlog for payments and transactions made after this time had been processed.
Mitigation and Prevention
The original capacity issue that prompted the upgrade of NSQ nodes was not noticed until it started affecting our services. We have now improved our visibility into NSQ performance and will be able to identify potential issues before reaching a point where it affects our service. This includes more metrics around message publishing latencies, as well as system-level monitoring of our NSQ nodes so we can identify when we are close to reaching resource limits.
The decision to unpause all message consumption in retrospect was poor, and could have been prevented if we were fully aware of the consequences of doing this in the presence of large backlogs. We are improving our team knowledge and awareness of how we use NSQ in order to prevent escalation of partial outages into full outages.
Looking further ahead, we realise that all core services need to be able to scale automatically and without human intervention, in response to increased or decreased load, and we will implement systems to make this possible.
We take all outages really seriously and we’re sorry for the impact this incident caused to our customers. Our aim has always been to build a bank that you can depend on. We’ve written this post-mortem because we believe you deserve to know what happened, and what steps we’ve taken to address these issues. If you have any questions, please do ask me and our team.