Mondo suffered 2 outages in the last 2 or 3 days.
Can you tell us what happened and what is done to avoid it in the future.
They’ve been moving the systems around
Hi Majdi… thanks for asking about this. I’m posting some fairly in-depth technical detail below. Sorry if this just seems full of jargon, but I want to give the best explanation possible .
We take outages like this really seriously. We want to be the bank with the best technology in the world, and we’re working as hard as we can to make that happen. This week has seen some particularly bad mess-ups, and we’re really sorry for them. As you point out, what’s important now is that we learn from our mistakes and prevent these problems from happening again.
###Tuesday’s outage
On Tuesday, @matt, @simon and I moved all our traffic over to a new backend platform. We made these changes to make our platform even more resilient and scalable. I know this might not seem directly relevant to the outage, but I think the context is helpful. Here are some of the most important things we did under the hood:
-
Changing how our services talk to one another
Our backend is composed of about 130 microservices (and we’re adding more all the time!). All of these need to communicate among themselves. We previously used RabbitMQ for this, but we’ve found it very difficult to operate in a cloud environment where virtual machines come and go continuously. Instead, we are now using HTTP internally, via a proxy called linkerd. -
Changing how our services run
We started using Docker to “containerise” our services, and Kubernetes to distribute these containers around a cluster of CoreOS machines. This means we can make the most efficient use of our AWS virtual machines, makes our platform more resilient to network partitions and machine failures, and isolates our services completely from each other. -
Implementing network policies
To improve security, we want to make absolutely sure that services within our infrastructure can only communicate with the things they should communicate with. We do this by putting firewalls between all our applications.
To move our production traffic over, we joined our old and new Cassandra clusters together so they would be sharing the same data. However, we didn’t account for a feature of Cassandra called “topology events.” These essentially mean that Cassandra will inform our services of all the Cassandra servers in the cluster. But while our new Cassandra servers could talk to the old Cassandra servers, our services could not (there’s a firewall which prevents this access). This meant they were trying to connect to servers that they were isolated from – and these connection failures led to requests failing When we identified the problem, the fix was simple: prevent our services from connecting to the “old” servers.
Because so much of this new platform is, well, new, it took us a while to work out what was going on. In the end, the fix was almost embarrassingly simple for such an extended period of downtime. The re-platform was an effort to reduce the time between failure of our systems, but it didn’t do anything to reduce the time to repair when things do go wrong. To help with this, we’re going to be adding some tools that will help us better understand problems when they happen, and alert us to these problems sooner.
Specifically, we’ll introduce a distributed tracing system (somewhat like Google’s Dapper), and beef up our monitoring considerably. This won’t happen overnight, but we’re working on it.
###Wednesday’s outage
In the early hours of Wednesday morning, our card processor became disconnected from MasterCard. This disconnection lasted for several hours, but crucially it coincided with the period when TfL does their nightly payments. Because TfL was unable to charge anyone’s card, they blocked all those cards from making journeys until the balance was paid We’ll be working with our card processor to make sure this doesn’t happen again.
Thanks for the detail. Some of it is over my head but I get the general gist
Presumably come the days of full banking you’ll be your own card processor meaning that aspect of things will be under better local control?
Do you believe that scaling using microservices is viable in a business that could grow to millions of users, in an environment where any loss of transactions is critical?
Many points of failure?
Being a ACID boy, I’d love to hear more in-depth technical detail about resilience etc. Put the wheels in Motion and reply
Any plans for a status page? Nothing too flash. Say something like this here
Right here!
Individual incidents are linked from top of the app when they happen.
Ha. Brilliant. Loving your work.
Genuinely fascinating. Love that you are so transparent about processes and technology. Don’t understand a word of it mind.
PS: Don’t mess up again @simon.
I do not know much about cassandra but it seems that instead of providing a single interface and routing traffic to its nodes (like a load balancer) it actually return the address of a node to communicate with (more like dns)
And regarding the mastercard endpoint, I thought you had more than a single one in different datacenters. Was it an autorisation issue or a clearing one ?
Finally, What do you use for deploying your different services ?
Thanks. I love the level of transparency and technical depth.
You lost me at Hello
Good question @trollied You’re right that micro-services architectures can be considerably more difficult to operate than a monolithic counterpart – something we butted our heads against this week.
Stepping back a bit, the fundamental reason we’ve chosen a micro-services architecture is that we want to be able to scale – to (hundreds of) millions of customers, to countries across the world, and to large numbers of developers. It’s our experience that at this kind of scale, a monolithic application becomes so difficult to change that development slows to a crawl. Teams instead need to be able to operate independently of one another: controlling their own deployments, responding to their own issues, and able to develop their own features. Breaking the application into co-operating but independent pieces is the how we achieve this, and is the approach we’ve learned from likes of Google, Facebook, Twitter, Netflix, etc.
Of course, this doesn’t come for free. When a single request may touch tens or even hundreds of backend services and hosts, it can be difficult to understand what’s happening when things go wrong. We need solid abstractions for inter-service communication, and we need tools that operate on this communication and give us visibility into it. I think we’re doing pretty good at the first half, and we need to work on the second.
Consistency is an interesting problem, but I don’t think it’s unique to micro-services. Even in a monolithic application, things do break – the correct behaviour of the system in the presence of failure is critically important. With so many network hops, these problems happen more frequently in distributed applications, but I think this can be a strength, not a weakness. By recognising, building in tolerance for, and even injecting failure into our applications, we can ensure things behave correctly under stress. Isolating the different parts of the system also makes it possible for us to work on product features without worry that it could impact the most low-level, core systems, which have a minimal surface area.
It’s worth pointing out that it’s very unlikely we will build our core ledger on top of Cassandra – its consistency model isn’t strong enough. Instead, we’re looking at storage that provides ACID-like consistency guarantees even at global scale, like those modelled after Google’s Spanner database (eg. CockroachDB).
Hopefully this explains some of our thinking about how we’ll solve these sorts of problems. Thanks for asking about this – I’m really enjoying this discussion and I’m thrilled people are interested in our backend tech
@mgaidia Cassandra is interesting in that it does a bit of both. A query can be issued to any node in the cluster – the receiving node is called the “co-ordinator node” – and it will proxy the query to the set of replicas and gather the responses to ensure the right level of consistency is achieved. In this sense, it is like a load balancer†.
But in addition to this, clients still need to have an up-to-date record of all Cassandra servers. Consider if we replaced all the nodes in the cluster (a valid – if extremely rare – operation). Without some mechanism to notify the client of the new nodes, it would be left with an outdated of Cassandra servers, and it wouldn’t be able to contact any of them. This is the “topology events” feature I mentioned in my previous post – indeed it’s a bit like a “push” version of DNS.
The card processing failure on Wednesday happened at our card processor. We remained connected to them throughout, but they became disconnected from MasterCard. We know that this isn’t good enough and we’re working with our suppliers to ensure we have as reliable a connection as possible going forward.
†This document explains the query co-ordination process quite well, if you’re interested to know more .
Which parts of the whole product cycle do you outsource and to whom ? Do you have plans to bring more stuff in house as the time goes ?
I don’t have a full list at the moment I’m afraid, but we have a bunch. They help us do things like card processing, identity checks, SMS and email delivery, in-app support chat, analytics, hosting, etc. Without these guys, Mondo wouldn’t exist today!
@oliver With so many microservices I’m just wondering if you’re taking a more nanoservices approach (i.e. a single function - ala AWS Lambda / Google Cloud Functions) or if each of these services you’ve mentioned above are fully fledged services with 1+ endpoints? Or a mixture of both approaches?
I too have found RabbitMQ to be a PITA to manage in an auto-scaling “cattle not pets” environment. Have you evaluated NATS or nsq at all?
Ok, what is the difference between a micro service and a nano service?
We’re moving towards SOA at work, but I’ve never heard of a nano service…
Microservices IMO is just another name for SOA.
Nanoservices individual functions of work exposed as service. The term has come about since AWS Lambda was launched (and others such as Google Cloud Functions).
In the context of a bank you might have expose some services with multiple endpoints such as a customer service (read/write customer data), a transaction ledger service, etc. However Oliver above said they’ve got 130+ of these services so I’m wondering if a large number of them are single purpose services (akin to a function; 10-100 lines of code) that just do one thing - e.g. a function that returns an emoji for a specific type of shop, or a function that sends a customer their pin by SMS.
By breaking apart a traditional service that offers multiple endpoints into multiple nanoservices you can scale at the macro level as well as offer agility in terms of developers in your organisation working in different programming languages or operating systems or different architectures.
A lot of this is made simpler thanks to orchestration technologies such as Kubernetes because rather than provisioning VMs or hardware for individual services you can push the service onto your cluster of servers and the orchestator will appropriately place the service onto hardware that has enough free resources.
On the other hand though, managing tens or hundreds of services instead of several larger services makes logging and internal communication a harder problem - an incoming external request might fail somewhere in the pipe as it touches 50 odd services so you’ve got to be able to trace the request and capture enough context in your logs that it’s possible to actually debug problems.
I’m sure I heard Matt or Jonas describe the Mondo architecture as “essentially a distributed tracer and logging platform that happens to also do banking”.
Most services have several endpoints – but it varies a lot. The average looks to be around 3, though the largest service has around 25 endpoints. I’m not 100% sure whether you’d classify them as micro- or nano- services to be honest.
We actually are already using NSQ as an asynchronous message queue – in contrast to Rabbit which we used as a synchronous bus (requests and replies). We really like how simple it is to operate and how performant it is, but because durability is so important to us, we’re going to be switching to Kafka soon.
This is a great read @oliver. Thanks for sharing all of it.
I notice that even for cases where you need ACID-like consistency, the tools you mention are quite new and you don’t even talk about using traditional stuff like the big relational DBs. Can you explain why? Do you think they are inferior?
SQL might not be cool, but it’s well understood and robust and it has its use cases. I’m interested in the reasons you’d pick a tool that’s still in research, as opposed to something that’s been proven in production for years.