Hi Majdi… thanks for asking about this. I'm posting some fairly in-depth technical detail below. Sorry if this just seems full of jargon, but I want to give the best explanation possible .
We take outages like this really seriously. We want to be the bank with the best technology in the world, and we're working as hard as we can to make that happen. This week has seen some particularly bad mess-ups, and we're really sorry for them. As you point out, what's important now is that we learn from our mistakes and prevent these problems from happening again.
On Tuesday, @matt, @simon and I moved all our traffic over to a new backend platform. We made these changes to make our platform even more resilient and scalable. I know this might not seem directly relevant to the outage, but I think the context is helpful. Here are some of the most important things we did under the hood:
Changing how our services talk to one another
Our backend is composed of about 130 microservices (and we're adding more all the time!). All of these need to communicate among themselves. We previously used RabbitMQ for this, but we've found it very difficult to operate in a cloud environment where virtual machines come and go continuously. Instead, we are now using HTTP internally, via a proxy called linkerd.
Changing how our services run
We started using Docker to "containerise" our services, and Kubernetes to distribute these containers around a cluster of CoreOS machines. This means we can make the most efficient use of our AWS virtual machines, makes our platform more resilient to network partitions and machine failures, and isolates our services completely from each other.
Implementing network policies
To improve security, we want to make absolutely sure that services within our infrastructure can only communicate with the things they should communicate with. We do this by putting firewalls between all our applications.
To move our production traffic over, we joined our old and new Cassandra clusters together so they would be sharing the same data. However, we didn't account for a feature of Cassandra called "topology events." These essentially mean that Cassandra will inform our services of all the Cassandra servers in the cluster. But while our new Cassandra servers could talk to the old Cassandra servers, our services could not (there's a firewall which prevents this access). This meant they were trying to connect to servers that they were isolated from – and these connection failures led to requests failing When we identified the problem, the fix was simple: prevent our services from connecting to the "old" servers.
Because so much of this new platform is, well, new, it took us a while to work out what was going on. In the end, the fix was almost embarrassingly simple for such an extended period of downtime. The re-platform was an effort to reduce the time between failure of our systems, but it didn't do anything to reduce the time to repair when things do go wrong. To help with this, we're going to be adding some tools that will help us better understand problems when they happen, and alert us to these problems sooner.
Specifically, we'll introduce a distributed tracing system (somewhat like Google's Dapper), and beef up our monitoring considerably. This won't happen overnight, but we're working on it.
In the early hours of Wednesday morning, our card processor became disconnected from MasterCard. This disconnection lasted for several hours, but crucially it coincided with the period when TfL does their nightly payments. Because TfL was unable to charge anyone's card, they blocked all those cards from making journeys until the balance was paid We'll be working with our card processor to make sure this doesn't happen again.