RESOLVED: Current account payments may fail - Major Outage (27/10/2017)

Weā€™re all liberal Guardian readers on here :wink:

(JOKE: I liked the Guardian for itā€™s investigative journalism, idk about other people)

2 Likes

I never said they were good articles :rofl: . The point given was it was widely reported through the press. Broadsheets? Not so much. But it was reported throughout the press and these two examples and were also not behind a paywall.

2 Likes

I used to like them years ago, but now they just whine about everything. I actually blocked their articles on Apple News as they was so irritating.

Wow

HSBC having a glitch

Now thatā€™s something

1 Like

The grauniad, unfortunately, has started posting these ā€˜articlesā€™ that are basically shills for a book the author is peddling. Tedious.

Hi everyone :wave: Iā€™m Monzoā€™s Head of Engineering, and as I promised on Friday Iā€™d like to share some more information about what happened during this outage. Because the nature of the issue was technical, this post is also quite technical. :nerd_face:

Itā€™s important to note that we had two major incidents last week that many of you will have experienced (sorry again.) The first incident lasted most of the week and affected only our prepaid product ā€“ ie. Monzo Alpha and Beta cards. The second outage affected both the prepaid product and our new current account for a period of around 1Ā½ hours on Friday afternoon. This post is about the latter.

You can learn more about our overall backend architecture in this blog post I published last year, but itā€™s important to understand the role of a few components in our stack at a high level to understand this issue:

  • Kubernetes is a system which deploys and manages all of our infrastructure. Monzoā€™s backend is written as several hundred microservices, packaged into Docker containers. Kubernetes manages these Docker containers and ensures they are running properly across our fleet of AWS nodes.

  • etcd is a distributed database used by Kubernetes to store information about which services are deployed, where they are running, and what state theyā€™re in. Kubernetes requires a stable connection to etcd in order to work properly, although if etcd does go down all of our services do continue running ā€“ they just canā€™t be upgraded, or scaled up or down.

  • linkerd is a piece of software that we use to manage the communication between all of the services in our backend. In a system like ours, thousands of network calls are happening every second, and linkerd does the job of routing and load balancing all of these calls. In order to know where to route these calls, it relies on being able to receive updates about where services are located from Kubernetes.

Timeline

  • Two weeks before: The Platform team makes some changes to our etcd cluster to upgrade it to a new version, and also to increase the size of the cluster. Previously, this cluster consisted of three nodes (one in each of our three zones); we raise this to nine (three in each zone.) Because etcd relies on being able to achieve a quorum to make progress, this means that in this setup we can tolerate the simultaneous loss of an entire zone and a single node in another zone.

    This upgrade went according to plan, and didnā€™t involve any downtime. Weā€™re satisfied that this cluster is behaving correctly, but this context is important as it triggers a bug in another system later.

  • One day before: A team developing a new feature for the current account deploys a new service to production, but notices that it is experiencing issues. As a precautionary measure, they scale the service down such that is has no running replicas, but the Kubernetes service still exists.

  • 14:10: An engineer deploys a change to a service needed to process payments for the current account. Making changes is not unusual and is something our engineers do all the time: to minimise the risk of changes we make them small and frequent, using a repeatable, well-defined process to release them. When this service was deployed however, all requests to it started to fail. This is when current account customers started experiencing payment failures. At this time, the prepaid card is not affected as it does not use the broken service.

  • 14:12: The change is rolled back. This is standard practice for deployments that donā€™t go according to plan, and when interfaces are changed they are backwards- and forwards-compatible to ensure that rolling back is a safe operation. However, in this case even after rolling back, the errors persisted and payments remained unavailable.

  • 14:16: We declare an outage internally. Members of the team start to convene to establish the impact of the problem and start to debug it.

  • 14:18: Engineers identify that linkerd appears to be in an unhealthy state, and attempt to use an internal tool designed to identify individual nodes that are experiencing problems and restart them.

    As described earlier, linkerd is a system which we use to manage communication between our backend services. To know where to send a particular request, it takes a logical name like service.foo from the request and turns it into an IP address/port. In this case, linkerd had not received an update from Kubernetes about where on the network the new pods were running. As such, it was trying to route requests to IP addresses that no longer correspond to a running process.

  • 14:26: We believe that the best path forward is to restart all linkerd instances in our backend, of which there are several hundred, under the assumption that they are all experiencing the same issue. In parallel, many engineers are attempting to minimise the customer impact seen making card payments or receiving bank transfers by activating internal processes designed to provide backup when we are experiencing problems. This means that most customers are still able to use their card successfully despite the ongoing instability.

  • 14:37: Replacement linkerdā€™s cannot start because the Kubelet that runs on each of our nodes is failing to retrieve the appropriate configuration from the Kubernetes apiservers. At this point, we suspect an additional issue with Kubernetes or etcd and restart the three apiserver processes. When this is complete, the replacement linkerd instances are able to start successfully.

  • 15:13: All linkerd pods are restarted, but services that process thousands of requests per second are now receiving no traffic. At this point, customers are completely unable to refresh their feed or balance in the Monzo app and our internal COps (ā€œCustomer Operationsā€ :policeman:) tools stop working. The issue has now escalated to a full platform outage, and no services are able to serve requests. As you can probably imagine, practically all of our automated alerts started triggering. :pager::fire:

  • 15:27: We notice that linkerd is logging NullPointerException when it is attempting to parse the service discovery response from the Kubernetes apiserver. We discover that this is an incompatibility between the versions of Kubernetes and linkerd that weā€™re running, and specifically is a failure to parse empty services.

    Because we have been testing an updated version of linkerd in our staging environment for several weeks which contains a fix for the incompatibility, engineers from the Platform team begin deploying a new version of linkerd in an attempt to roll forward.

  • 15:31: After inspecting the code change, engineers realise that they can prevent the parsing error by deleting Kubernetes services which contain no endpoints (ie. the service mentioned earlier that was scaled down to 0 replicas as a precautionary measure.) They delete the offending service and linkerd is successfully able to load service discovery information. At this point, the platform recovers, traffic starts transiting between services normally, and payments start to work again. The incident is over. :relieved:

Root cause

At this point, while weā€™d brought our systems back online, we did not yet understand the root cause of the problem. The network is very dynamic in our backend because of deployment frequency and automated reaction to node and application failure, so being able to trust our deployment and request routing subsystems is extremely important.

Weā€™ve since found a bug in Kubernetes and the etcd client that can cause requests to timeout after cluster reconfiguration of the kind we performed the week prior. Because of these timeouts, when the service was deployed linkerd failed to receive updates from Kubernetes about where it could be found on the network. While well-intentioned, restarting all of the linkerd instances was an unfortunate and poor decision that worsened the impact of the outage because it exposed a different incompatibility between versions of software we had deployed.

Remarks

A large scale failure in a distributed system can be very difficult to understand, and well-intentioned human action can sometimes compound issues, as happened here. When things like this do happen, we want to learn as much as possible from the event to ensure it canā€™t resurface. Weā€™ve identified several steps weā€™ll take in the short-term:

  1. Fix the bug in Kubernetes that can trigger timeouts following a cluster reconfiguration.

  2. Roll out a new version of linkerd that fixes the parsing error.

  3. Create better health checks, dashboards and alerts for the components impacted to surface clearer signals about what is wrong and prevent human error.

  4. Improve our procedures to ensure we communicate outages internally and externally as clearly and quickly as possible.

I want to reassure everyone that we take this incident very seriously; itā€™s among the worst technical incidents that have happened in our history, and our aim is to run a bank that our customers can always depend on. We know we let you down, and weā€™re really sorry for that. I hope that this post-mortem gives some clarity on what happened and what weā€™re doing to make sure it doesnā€™t recur. Iā€™ll make sure we post something similar for any other incident of this severity: if I were a customer Iā€™d want to know, and also I personally find this kind of post fascinating as an insight into production systems. Do let me know if you have any questions. :pray:

113 Likes

What an absolutely fascinating read. Thanks for such detailed insight into the inner workings (or not :stuck_out_tongue_closed_eyes:) of Monzo. :+1:

10 Likes

The three services you mention - are these products that you subscribe to or purchase outright? Are they physically on local servers or purely in the cloud? Do they force upgrades on you?

Some organisations try to gloss over things that went wrong. This sort of explanation only gives me greater confidence in Monzo.

17 Likes

I was gonna mention that but I am sure someone is watching :eyes:

kubernetes is based on the platform that runs google

1 Like

All three of these components ā€“ like the vast majority of our backend ā€“ are free, open-source software. Often these projects have started by or are built upon technology used by large internet companies. Upgrades are totally within our control, but we generally try to run within a few versions of the latest. :up:

7 Likes

Iā€™m with @anon70107404 on this one. Thank you very much with the detailed explanation.

I have a question if I may?

Following the impact that this change had on the service, are there any plans to run potential service affecting changes ā€œout of hoursā€? Not that there is any such thing in the banking world really but between the hours of 10pm - 6am I would imagine the traffic would be lower?

Thanks!

2 Likes

As someone that works on backend systems using Docker etc, that was a fascinating read! I love how open, detailed and technical it was :ok_hand:

5 Likes

Thaaank you so much for sharing. Iā€™ll gonna read it few times later on, just so Iā€™m sure I understood it all. :smiley:

3 Likes

This is a superb insight and I really appreciate you writing it :+1:

As someone that has used Docker/Kubernetes (on a much smaller level) I can see how when scaled up in can be very difficult to identify exactly where you have an issue - especially when the breaking change doesnā€™t manifest itself immediately.

3 Likes

Did nobody tell you guys about the friday deployments rule?

8 Likes

:star_struck: Excellent post-mortem that increases my confidence in monzo as a bank, which was also fun to read a developer.

Please keep being open like this about your tech stack and future outages as you grow - there are bound to be some :smiling_imp:

5 Likes

Thanks for the transparency over this. When companies hide facts it only causes users to guess and lose confidence in a service. This post is shows Monzo is willing to engage with the community and increases confidence and trust in the bank. Please keep up the great work.

3 Likes

Thanks @oliver for this detailed insight very informative, Recently attended Google cloud platform workshop so I could make sense and understand the whole thing.
Not sure Santander would explain all this if things went down for a bit so thank you Monzo for this. As @anon70107404 said;

2 Likes