RESOLVED: Current account payments may fail - Major Outage (27/10/2017)

oliver · 30 October 2017 20:37

Hi everyone I’m Monzo’s Head of Engineering, and as I promised on Friday I’d like to share some more information about what happened during this outage. Because the nature of the issue was technical, this post is also quite technical.

It’s important to note that we had two major incidents last week that many of you will have experienced (sorry again.) The first incident lasted most of the week and affected only our prepaid product – ie. Monzo Alpha and Beta cards. The second outage affected both the prepaid product and our new current account for a period of around 1½ hours on Friday afternoon. This post is about the latter.

You can learn more about our overall backend architecture in this blog post I published last year, but it’s important to understand the role of a few components in our stack at a high level to understand this issue:

Kubernetes is a system which deploys and manages all of our infrastructure. Monzo’s backend is written as several hundred microservices, packaged into Docker containers. Kubernetes manages these Docker containers and ensures they are running properly across our fleet of AWS nodes.
etcd is a distributed database used by Kubernetes to store information about which services are deployed, where they are running, and what state they’re in. Kubernetes requires a stable connection to etcd in order to work properly, although if etcd does go down all of our services do continue running – they just can’t be upgraded, or scaled up or down.
linkerd is a piece of software that we use to manage the communication between all of the services in our backend. In a system like ours, thousands of network calls are happening every second, and linkerd does the job of routing and load balancing all of these calls. In order to know where to route these calls, it relies on being able to receive updates about where services are located from Kubernetes.

Timeline

Two weeks before: The Platform team makes some changes to our etcd cluster to upgrade it to a new version, and also to increase the size of the cluster. Previously, this cluster consisted of three nodes (one in each of our three zones); we raise this to nine (three in each zone.) Because etcd relies on being able to achieve a quorum to make progress, this means that in this setup we can tolerate the simultaneous loss of an entire zone and a single node in another zone.

This upgrade went according to plan, and didn’t involve any downtime. We’re satisfied that this cluster is behaving correctly, but this context is important as it triggers a bug in another system later.
One day before: A team developing a new feature for the current account deploys a new service to production, but notices that it is experiencing issues. As a precautionary measure, they scale the service down such that is has no running replicas, but the Kubernetes service still exists.
14:10: An engineer deploys a change to a service needed to process payments for the current account. Making changes is not unusual and is something our engineers do all the time: to minimise the risk of changes we make them small and frequent, using a repeatable, well-defined process to release them. When this service was deployed however, all requests to it started to fail. This is when current account customers started experiencing payment failures. At this time, the prepaid card is not affected as it does not use the broken service.
14:12: The change is rolled back. This is standard practice for deployments that don’t go according to plan, and when interfaces are changed they are backwards- and forwards-compatible to ensure that rolling back is a safe operation. However, in this case even after rolling back, the errors persisted and payments remained unavailable.
14:16: We declare an outage internally. Members of the team start to convene to establish the impact of the problem and start to debug it.
14:18: Engineers identify that linkerd appears to be in an unhealthy state, and attempt to use an internal tool designed to identify individual nodes that are experiencing problems and restart them.

As described earlier, linkerd is a system which we use to manage communication between our backend services. To know where to send a particular request, it takes a logical name like service.foo from the request and turns it into an IP address/port. In this case, linkerd had not received an update from Kubernetes about where on the network the new pods were running. As such, it was trying to route requests to IP addresses that no longer correspond to a running process.
14:26: We believe that the best path forward is to restart all linkerd instances in our backend, of which there are several hundred, under the assumption that they are all experiencing the same issue. In parallel, many engineers are attempting to minimise the customer impact seen making card payments or receiving bank transfers by activating internal processes designed to provide backup when we are experiencing problems. This means that most customers are still able to use their card successfully despite the ongoing instability.
14:37: Replacement linkerd’s cannot start because the Kubelet that runs on each of our nodes is failing to retrieve the appropriate configuration from the Kubernetes apiservers. At this point, we suspect an additional issue with Kubernetes or etcd and restart the three apiserver processes. When this is complete, the replacement linkerd instances are able to start successfully.
15:13: All linkerd pods are restarted, but services that process thousands of requests per second are now receiving no traffic. At this point, customers are completely unable to refresh their feed or balance in the Monzo app and our internal COps (“Customer Operations” ) tools stop working. The issue has now escalated to a full platform outage, and no services are able to serve requests. As you can probably imagine, practically all of our automated alerts started triggering.
15:27: We notice that linkerd is logging NullPointerException when it is attempting to parse the service discovery response from the Kubernetes apiserver. We discover that this is an incompatibility between the versions of Kubernetes and linkerd that we’re running, and specifically is a failure to parse empty services.

Because we have been testing an updated version of linkerd in our staging environment for several weeks which contains a fix for the incompatibility, engineers from the Platform team begin deploying a new version of linkerd in an attempt to roll forward.
15:31: After inspecting the code change, engineers realise that they can prevent the parsing error by deleting Kubernetes services which contain no endpoints (ie. the service mentioned earlier that was scaled down to 0 replicas as a precautionary measure.) They delete the offending service and linkerd is successfully able to load service discovery information. At this point, the platform recovers, traffic starts transiting between services normally, and payments start to work again. The incident is over.

Root cause

At this point, while we’d brought our systems back online, we did not yet understand the root cause of the problem. The network is very dynamic in our backend because of deployment frequency and automated reaction to node and application failure, so being able to trust our deployment and request routing subsystems is extremely important.

We’ve since found a bug in Kubernetes and the etcd client that can cause requests to timeout after cluster reconfiguration of the kind we performed the week prior. Because of these timeouts, when the service was deployed linkerd failed to receive updates from Kubernetes about where it could be found on the network. While well-intentioned, restarting all of the linkerd instances was an unfortunate and poor decision that worsened the impact of the outage because it exposed a different incompatibility between versions of software we had deployed.

Remarks

A large scale failure in a distributed system can be very difficult to understand, and well-intentioned human action can sometimes compound issues, as happened here. When things like this do happen, we want to learn as much as possible from the event to ensure it can’t resurface. We’ve identified several steps we’ll take in the short-term:

Fix the bug in Kubernetes that can trigger timeouts following a cluster reconfiguration.
Roll out a new version of linkerd that fixes the parsing error.
Create better health checks, dashboards and alerts for the components impacted to surface clearer signals about what is wrong and prevent human error.
Improve our procedures to ensure we communicate outages internally and externally as clearly and quickly as possible.

I want to reassure everyone that we take this incident very seriously; it’s among the worst technical incidents that have happened in our history, and our aim is to run a bank that our customers can always depend on. We know we let you down, and we’re really sorry for that. I hope that this post-mortem gives some clarity on what happened and what we’re doing to make sure it doesn’t recur. I’ll make sure we post something similar for any other incident of this severity: if I were a customer I’d want to know, and also I personally find this kind of post fascinating as an insight into production systems. Do let me know if you have any questions.