RESOLVED: Current account payments may fail - Major Outage (27/10/2017)

jzw95 · 31 October 2017 09:25

Oliver said they do this to test new features, not find bugs in code. I imagine they debug the code thoroughly before deploying to users. But, for instance, it allows them to quickly release different versions of an overdraft interface for a few groups of test users, get feedback, and refine the model. I think it’s more about testing how features/interfaces work than debugging code. At least that’s how I read it, could be wrong.

tommy5dollar · 31 October 2017 09:28

None of this message makes the slightest sense

crablab · 31 October 2017 09:31

They have a staging environment - and new features will be tested in a sandbox before they ever hit a live system. I would imagine they are then tested internally whilst running on “production” infrastructure (because you can’t use MasterCard’s network in a sandbox…) before being released to customers.

Important thing to remember: Monzo is in beta.

DaveTMG · 31 October 2017 09:34

I’m happy you are not developing software for me then

coffeemadman · 31 October 2017 09:34

Like I said several times, we can agree to disagree.

DaveTMG · 31 October 2017 09:35

That’s what I hope they use - I was querying @oliver on his statement

And being in beta is not an excuse for doing things wrong. There is real money at stake. Beta is for deciding what features they have, not for releasing buggy software.

jonomacd · 31 October 2017 10:33

Dave, I think you misunderstood here, this is testing new customer facing features to see if customers like the feature NOT testing if the features are technically functional. Huge difference there. They clearly have a staging environment, release process, and sophisticated rollout/rollback powers. Kubernetes provides excellent tooling for this. This particular bug seems like a perfect storm of problems that can be hard if not impossible to find in testing environments. Kubernetes is a new technology and these corner cases take time to sort out. But this technology has the backing of pretty much every big player in the cloud including Google and Amazon. These issues are certain to fixed in short order and Monzo will be better for it. New technology comes with risk but the rewards can be huge. This is the reason why the current account is still in an early testing phase.

oliver · 31 October 2017 10:44

“Test” in this context means to release a feature to users (or a subset of them) to see how they like it. It does not mean that we have no idea whether the code works until we promote it to production: we have automated tests and staging environments for that which I referenced in the post-mortem.

DaveTMG · 31 October 2017 10:58

Thanks, that’s reassuring.

But still, your inability to rollback should be of major concern.

crablab · 31 October 2017 10:59

Uhm

(Nb. this didn’t work, because as Oliver explains - the issue was not with the actual code/services but with an system incompatibility)

DaveTMG · 31 October 2017 11:00

The rollback failed - they ended up having to roll forward - this would have been a huge problem if there hadn’t been a version to roll forward to.

In the end this is a failure of the testing and deployment of new features. If the staging environment had accurately replicated the real one, this bug would have been found there.

crablab · 31 October 2017 11:04

I would agree with this - often it is very tempting to upgrade the environment code runs on, and it is very important to test that environment actually works with your configuration before it goes out to production.

danielfosbery · 31 October 2017 11:06

Great write up @oliver. The transparency you are providing is great.

As someone who has been using Docker extensively in the last 12 months and am just starting to dip my toe into Kubernetes it is very interesting to read about how you use these services but even more so to read about how you debugged the issue.

jonomacd · 31 October 2017 11:16

While I agree that more test could have found this, it is not altogether clear it would have without a cripplingly monumental effort. The whole point of something like kubernetes and isolated services is to allow for fast, frequent and somewhat independent deployment from different teams across the company. It is one of the core reasons you build your infrastructure the way they have built it.

These things happen even to the best tested and best run systems in the world. They especially happen when the system is new. There are many lessons Monzo can learn from an outage like this and from the sounds of it they are learning those lessons. But you are casting it in a particularly negative light under some faulty assumptions.

tommy5dollar · 31 October 2017 11:30

This seems like the sort of problem that would (and regularly does) floor large multinational companies. Considering how much went wrong and the small team you have, I’m amazed that Monzo managed to get back online so quickly.

DaveTMG · 31 October 2017 11:41

What light you read is your problem not mine. Perhaps you can point out the faulty assumptions so we can all learn?

My only assumption is that it is a failure to adequately test. That is self evident. As to it being a monumental effort, would you accept that as a reason for not testing when your money disappears in a hack? If the testing methodology doesn’t find a bug, you update the methodology.

It really doesn’t matter what magical technology is used, ultimately it is a bank and it holds my money. If allowing for ‘frequent and independent deployment from different teams’ causes uncatchable bugs, then that methodology is flawed.

tomsr · 31 October 2017 11:43

Just a gentle reminder that we should respond to the content, rather than the perceived tone, of posts please

mark1 · 31 October 2017 11:43

Thats what I believe the FSCS protection is for.

DaveTMG · 31 October 2017 11:45

Are you really suggesting monzo don’t need to have a methodology to catch bugs because we are protected by FSCS?

oliver · 31 October 2017 11:46

I don’t think anyone is trying to pretend that we didn’t screw up. We did, and we’re truly sorry. We will learn from these mistakes to ensure this class of problem can’t happen again, as I hope the post-mortem demonstrates. I fully agree that better forms of testing are one of the ways we need to improve.

To be very clear, nobody’s money has “disappeared.”