Absolutely superb blog post - honest and detailed but very clearly described too. The comparison with the likes of the TSB IT issues last year is stark!
The first question that came to my mind, as a DBA myself, was, if you are using cluster, why not have a failover cluster for the database to avoid this kind of issue to begin with? I applaud your root cause investigation, and write up, I seldom see that kind of work these days. But sometimes we have to look at our original designs to see if and where we could have done better. Great job on transparency too.
Good stuff. Appreciate the detail. You mentioned the Cassandra quorum. Is it possible to monitor the quorum disagreements? That telemetry certainly would have told you that there was a problem with Cassandra when there was an uptick of disagreements between the read replicas. This would also have shown you in testing when you added a single new server.
Also, it might be interesting to have automations that do general reads from Cassandra for random keys. You wouldnât want to be putting serious load on the backend but this might be faster to help discover Cassandra problems as opposed to waiting for upstream services to fail which probably lags. Not sure how you would determine a random key or how to gauge the frequency of the testing.
Hey Gray
Welcome to the community
That sounds like a very good suggestion
Cannot say whether thatâs a stat they can actually expose, but they did mention finding more aspects to monitor for signs of issues
This is great suggestion, and something weâve already picked up.
Since the incident weâve updated our internal Cassandra health check service to write several hundred keys with known values, and attempt to read them back. We then expose the results as metrics for our monitoring system to pick up.
We tested scaling up a test cluster with the same method we used on July 29th and this method immediately caught the data inconsistencies. We now run this service against all of our clusters and itâs hooked up to a paging alert for our on-callers.
This is amazingâŚ
Itâs so refreshing to understand and see what Monzo did behind the scenes
Classic Millennial that Cassandra, needs to pull herself up by the auto_bootstrap
s
Although db config has been to blame, we can always get config wrong. You canât make developers trawl through all the config docs, sometimes the docs arenât very good anyway. The root cause was a missing exception handling. A missing row from Cassandra was ignored? If only the compiler could prevent these mistakes ? https://www.rust-lang.org/
Welcome to the community, but not sure why you have gone with that point. See the comment from @dig090 above for an example of a genuinely useful and on point suggestion
Shouldnât you be touting this point on the Phoronix forums instead? The zealous campaign to replace the world with Rust
Language choice would not have helped here, and they are already using Go which has exception handling and runtime checking just as good as Rust
There is plenty of grotty C out there that could do with being replaced long before you start tackling a modern codebase
Oooh, @SouthseaOne. Go is a great language! Just pointing out my perspective, and informing other users of a safer alternative. Rust doesnât even have exceptions . It requires error handling before the code would compile. Why would you replace C applications? They work great too!