Nationwide had a major outage yesterday that made the news (but not their service status page).
dont get flattered. just opened topic, read a couple of comments and saw yours. since i work in banking, thought to set facts straight
This is a good read/study on the cascading effect of compound distributed system across different technologies on critical system. I guess āfailure is not an optionā is really not an option anymore. Good luck to Monzo team on strengthening the technology and processes.
@oliver Great article! This is helpful for researcher and operations engineers.
I would like to add one small bit. In an environment change procedure (kuberneted/etcd/linkerd), you could have an addition step to verify/test the actual functionality of the component. Example: in case of etcd upgrade, you could deploy/scale up a service and check if the kubernetes/etcd/linkerd reflect this change.
This issue has now been resolved and discussed in depth. Monzo staff will re-open this thread if thereās more to be reported.