We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it

Ediflyer · 18 August 2019 22:35

Absolutely superb blog post - honest and detailed but very clearly described too. The comparison with the likes of the TSB IT issues last year is stark!

njlytle · 19 August 2019 15:09

The first question that came to my mind, as a DBA myself, was, if you are using cluster, why not have a failover cluster for the database to avoid this kind of issue to begin with? I applaud your root cause investigation, and write up, I seldom see that kind of work these days. But sometimes we have to look at our original designs to see if and where we could have done better. Great job on transparency too.

dig090 · 19 August 2019 15:21

Good stuff. Appreciate the detail. You mentioned the Cassandra quorum. Is it possible to monitor the quorum disagreements? That telemetry certainly would have told you that there was a problem with Cassandra when there was an uptick of disagreements between the read replicas. This would also have shown you in testing when you added a single new server.

Also, it might be interesting to have automations that do general reads from Cassandra for random keys. You wouldn’t want to be putting serious load on the backend but this might be faster to help discover Cassandra problems as opposed to waiting for upstream services to fail which probably lags. Not sure how you would determine a random key or how to gauge the frequency of the testing.

SouthseaOne · 19 August 2019 16:09

Hey Gray

Welcome to the community

That sounds like a very good suggestion

Cannot say whether that’s a stat they can actually expose, but they did mention finding more aspects to monitor for signs of issues

anon61228674 · 20 August 2019 07:49

This is great suggestion, and something we’ve already picked up.

Since the incident we’ve updated our internal Cassandra health check service to write several hundred keys with known values, and attempt to read them back. We then expose the results as metrics for our monitoring system to pick up.

We tested scaling up a test cluster with the same method we used on July 29th and this method immediately caught the data inconsistencies. We now run this service against all of our clusters and it’s hooked up to a paging alert for our on-callers.

Fkofilee · 19 December 2019 06:35

This is amazing…
It’s so refreshing to understand and see what Monzo did behind the scenes

BritishLibrary · 19 December 2019 08:03

Classic Millennial that Cassandra, needs to pull herself up by the auto_bootstraps

ZenBoat · 23 December 2019 10:26

Although db config has been to blame, we can always get config wrong. You can’t make developers trawl through all the config docs, sometimes the docs aren’t very good anyway. The root cause was a missing exception handling. A missing row from Cassandra was ignored? If only the compiler could prevent these mistakes ? https://www.rust-lang.org/

SouthseaOne · 23 December 2019 13:44

Welcome to the community, but not sure why you have gone with that point. See the comment from @dig090 above for an example of a genuinely useful and on point suggestion

Shouldn’t you be touting this point on the Phoronix forums instead? The zealous campaign to replace the world with Rust

Language choice would not have helped here, and they are already using Go which has exception handling and runtime checking just as good as Rust

There is plenty of grotty C out there that could do with being replaced long before you start tackling a modern codebase

ZenBoat · 23 December 2019 15:11

Oooh, @SouthseaOne. Go is a great language! Just pointing out my perspective, and informing other users of a safer alternative. Rust doesn’t even have exceptions . It requires error handling before the code would compile. Why would you replace C applications? They work great too!

Topic		Replies	Views
Monzo Stability Monzo Chat	46	3649	31 January 2020
Outages Monzo Chat	19	5026	21 September 2016
We had problems with bank transfers on 30th May. Here's what happened and how we're fixing it for the future News & Updates	42	4737	29 June 2019
Empathy & responsibility when communicating issues Feedback & Ideas	29	2801	1 April 2018
A tip of the hat to Monzo Customer Service Monzo Chat	6	647	2 September 2023

We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it

Related topics