Monzo Staff Weekly Q&A - Oliver Beattie (Head of Engineering) - Reliability Report Special!

simonb · 25 July 2018 13:52

Note : All Monzo Q&As to date can be found here

Community-Q%26A-Forum

Readers, Revellers, and Rebels, let’s gather round to reason and reflect, for the time is right to return.

tenor-180797668

It’s the Monzo Weekly Staff Q&A!!!

We’ve just published our Reliability Report - go check it out if you haven’t already. As you may know, we believe in being radically transparent. Not just when things go right, but also when there are challenges. These might include service interruptions and/or downtime. These things happen, but we don’t believe you should be kept in the dark! When there is an issue that might affect customers, it’s our responsibility to be clear, informative and helpful in our communication, keep you in the loop and let you know what we’re doing to fix any situation.

That said, it’s important that keeping you in the loop doesn’t create a perception that we suffer from any more service interruptions or outages than anyone else, because we simply don’t! That’s the risk - in being communicative, you may unknowingly create a false narrative. In fact, we believe our platform is inherently more stable due to the way we’ve built it.

Hence, the Reliability Report! With this happening, it was only right to call on this week’s guest to do Q&A.

BUT FIRST!

tenor-78253550

Catch up with all previous Q&A’s below!

Week 1 : Chris MacLean, Customer Operations & Vulnerable Customers
Week 2 : James Nicholson, iOS Engineer
Week 3 : Tara Mansfield, People Operations Manager
Week 4 : James Routley, Backend Engineer
Week 5 : Hugh Wells, Customer Operations ‍♂
Week 6 : Naz Malik, Technical Specialist
Week 7 : Fred Morgan, COps Squad Captain (Calls & Social Media)
Week 8 : Emma Northcott, COps Scaling Team
Week 9 : Jarno Wolf, COps Financial Crime Specialist & Squad Captain
Week 10 : Maria Campbell, Head of People
Week 11 : Jim Amey, Night COps Captain
Week 12 : Richard Cook, Online Community Manager
Week 13 : Beatrice Borbon, Content & Press Manager
Week 14 : Tom Blomfield, CEO
Week 15 : Ella Johanny, COps/Hiring
Week 16 : Harry Ashbridge, Writer
Week 17 : Beth Scott, Overnight COps
Week 18 : Georgie Parmenter, Executive Assistant to the Founders
Week 19 : Vulnerable Customers Team
Week 20 : Leah Templeman, Interim VP People
Week 21 : Daniel Chatfield, Backend Engineer, Fincrime & Security
Week 22 : Valerio Magliulo, Product Manager - Revenue Team
Week 23 : Sam Watkin, Operations Analyst
Week 24 : Kieran McHugh, Backend Engineer
Week 25 : Jonas Huckestein, Co-Founder and Chief Technical Officer
Week 26 : Annual Report Edition with Tristan Thomas and Julie Oey
Week 27 : Zander Brade, Lead Product Designer
Week 28 : Richard Dingwall, Payments Engineer

This week in the Hot Coral Hot Seat™️ we’ve got our Head of Engineering, Oliver Beattie!!!

Oliver has been here at Monzo since June 2015. He’s currently working with our COps Scaling team to make some big improvements to how we do customer support at scale

Fun Fact About Oliver
His first job was in Hawaii!

His favourite thing about working for Monzo?
“There are so many domains of knowledge that I used to think sounded super dull, but I’ve been able to discover at Monzo that that isn’t the case and they’re actually fascinating. Financial crime, finance/accounting, and customer service, for instance.”

You know how it goes - get your questions in, and Oliver will be here tomorrow to answer them! We don’t have as much time as usual this week to get questions in, so get 'em in quick!

Slicknade · 25 July 2018 14:09

I’ll start this one off!

I’ve seen mentioned that staff use Slack to help them with their day to day…

I’m always on the look out for how software can make my life easier (hence my love on Monzo), and since discovering Slack and Trello I’ve found myself much more organised.

Do you have any other hidden gems worth looking at? And how do they help you and your team?

anon95933191 · 25 July 2018 14:17

If you could travel back in time and give 2015 Oliver one piece of advice about the road ahead what would it be?

jwhiterz · 25 July 2018 14:29

What was the hardest aspect of the current account system to get into a reliable state?

Dannytc · 25 July 2018 14:38

Obviously critical, Pineapple Pizza (Did you eat many of these in Hawaii?! ) & Cats?

Pure Trivial, 600 services, is this to maintain performance, or does completing a single bank transfer for example involve a large number of different services being called for the single task?

jwhiterz · 25 July 2018 14:45

What challenges (if any) does the Monzo architecture face that wouldn’t be apparent with traditional bank architecture?

What positives (if any) does the Monzo architecture face that wouldn’t be apparent with traditional bank architecture?

Rat_au_van · 25 July 2018 14:48

The majority of the recent minor issues seem to have been bank transfer related. Apart from the big Faster Payment outage what has been the cause of the others? Is it a Monzo issue or someone else?

anon58623005 · 25 July 2018 18:06

Does anything in particular give you sleepless nights when it comes to reliability, such as impacting a VIP with large influence, impacting large volumes of customers in general or a smaller number but with a long downtime?

ChrisBeldam · 26 July 2018 07:05

Do you think factors beyond your control e.g Mastercard outages etc make it harder to keep Monzo’s reputation for reliability? Do you think that posting on twitter that it’s a general mastercard or faster payments issue and not Monzo specific sufficiently shows its not Monzo being unreliable but just that external system has gone down?

joshpriceonline · 26 July 2018 09:30

Hi Oliver ,

To kick things off, your talk at KubeCon was awesome!!! It was great to hear you share your experience of an outage and ultimately the lessons that were learnt.

At the moment the majority of Monzo’s infrastructure is based on AWS. Do you foresee using any other cloud providers in the future to aid reliability? If so, have you designed the back-end in a way where you could host across multiple cloud providers?

Do you think Monzo could be more transparent behind the reasons for outages? I’m more referring here to the postmortems which we’ve seen in the past but for some of the more recent outages we haven’t.

As Monzo starts to expand to other countries in the next few years, what do you think the biggest challenges will be from an engineering prospective?

anon95933191 · 26 July 2018 13:11

Throwing in a sneaky second question. Google’s new Go cloud library has me very excited I am curious what are your opinions?

anon9133846 · 26 July 2018 13:13

It’s been mentioned already, but the Mastercard outage affected Monzo and other banks too. Can you shed some light as to what this outage was and how Mastercard are ensuring it doesn’t happen again?

Also, I’ve spoken to a number of people who would love to give Monzo a try and potentially go full Monzo, but the constant issues are really holding them back especially given that there was promise that all the issues people faced would be resolved once CA is out in full.

anon93413334 · 26 July 2018 20:45

You use metrics to track outages and issues and you mention categorising incidents according to assumed customer impact - do you use any other metrics on customer interactions integrated with this data to get an holistic view of how an incident has affected customers in the past? More generally, how do you use data to help your day-to-day role?

anon77247897 · 26 July 2018 23:37

Hi everyone, excited to be doing Q&A for the first time. Reading that I’ve been here since 2015 above was a little mindblowing, even for me. Time flies!

We have a vast amount of internal knowledge. It’s important that it’s accessible to everyone at the company, especially since we’re growing so quickly. An internal wiki is a way many companies will address this, but we’ve found most wiki software very cumbersome to the point that people avoid using it.

Notion has recently started to fill this gap for us nicely – and people seem to love using it!

Teaching humans to do the things you can gives you enormous leverage, just like writing software.

All of them But more seriously, and perhaps unsurprisingly, I think our Mastercard systems have been the most tricky. Card schemes are huge, global networks with a lot of inherent complexity and a lot of features that have evolved over decades. In many cases, it can be hard to find anyone who understands how aspects of them work! We have some fantastic people working on these who I would now consider to be world experts, but we’re still finding new edge cases and better ways of doing things.

Pineapple on pizza isn’t right. Sorry Jonas

Our choice of a microservices architecture wasn’t really driven by performance considerations: a network RPC call is never going to be as fast as a local function call. It really excels for us because it lets independent teams work on different areas of the codebase without a lot of co-ordination overhead. It certainly comes with costs, but we are really reaping the benefits now we have a team of engineers more than 100-strong.

As Monzo continues to mature, I get more and more confident that we can handle Really Bad™️ situations.

We’ve made a really deliberate effort to improve reliability over the last 12 months, and I think the data speak for themselves that it’s working. We’ve put a lot of effort into making issues less frequent, but just as importantly we’ve improved our ability to deal with issues swiftly and effectively when they (inevitably) do happen. I’m really proud of all the people that who are making this happen.

I hear you. Our communications during outages need to get clearer, and we’re working hard to make that happen

There are some situations where we can’t share as much information as we’d like to, and they frustrate us too. However, there is a lot of room for improvement within the bounds of what we can do. We’re looking at putting information in the most relevant places within our apps when specific features are having problems. And sharing whether a problem is likely to just affect Monzo or whether other banks might be impacted as well.

I also hope that by sharing our internal data over the last year, and committing to do this on an ongoing basis, we help people see that our reliability really is improving and is comparable to that of other banks.

Connecting to more and more local payment networks around the works will be a lot of work and will put a lot of new demands on our platform.

I’ve also recently started working with COps; I’m learning there are a lot of really fascinating technical challenges that come with delivering the world’s best customer support at a large scale.

I haven’t had chance to look at it a great deal yet: for the most part our services are fairly agnostic of the cloud infrastructure they’re running on already. There does seem to be a fair amount of buzz around the library thoughm and I found this Tweetstorm by Rob Pike interesting:

https://twitter.com/rob_pike/status/1021913256204460032

Topic		Replies	Views
Monzo Stability Monzo Chat	46	3705	31 January 2020
The Monzo Reliability Report: How we're building a bank you can rely on News & Updates	6	1849	25 July 2018
We had issues with Monzo on 29th July. Here's what happened, and what we did to fix it News & Updates	69	6302	23 December 2019
RESOLVED: Current account payments may fail - Major Outage (27/10/2017) Help	184	72367	5 December 2017
Why Monzo Cards Aren't Working Today News & Updates	92	12896	8 July 2017

Monzo Staff Weekly Q&A - Oliver Beattie (Head of Engineering) - Reliability Report Special!

Related topics