Monzonaut AMA - Suhail - Staff Engineer đź› 

We’re going to keep this AMA hype going and for our second in the series we’ve got @suhailpatel who is a staff engineer here at Monzo :tada:

I could try and explain what that involves but I’d do it a complete injustice so we’ve got a little bio from the man himself below:

I’m one of the Staff Engineers at Monzo. I work in the Platform Collective, focusing on our technical infrastructure powering the bank. This includes things like our Kubernetes platform, databases and data storage, and much more…

As a member of the Staff Engineering community, I help inform and guide on building secure, reliable and robust systems. We also work on building the shape and role of technology at Monzo.

One of my first tasks was to make sure Monzo scaled well enough to have a successful crowdfunding. Thankfully, that went without a hitch and it’s been very fun helping scale Monzo’s Platform as our customer base has increased!

Before Monzo, I was working at Citymapper, building the backend to help folks get from A to B using Public Transport.

I know how you enjoy learning about the inner workings of Monzo so I feel Suhail will be able to answer those kinds of questions pretty well.

Let’s see what a staff engineer would prefer more: sneezing every time someone says hello to you or falling over every time someone says hello to you :eyes:

3 Likes

What’s been the biggest improvement you’ve helped work on here at Monzo?

Is there any aspect of working at Citymapper which you feel would be great to replicate at Monzo?

1 Like

I’m not a staff engineer, but it’s gotta be a sneeze, no?

I have questions!

What’s weirder, money or movement?

More generally, how much autonomy do you have as an individual and does the Collective (feels either communist or Borg) have? How do you interact with, say Product, on making things better for the customer?

If you could magic any feature into existence for either CityMapper or Monzo, what would it be?

Do you have a staff version of Monzo and how many special flags do you have turned on? :eyes:

3 Likes

I’m going to go out there and say the falling over. It’d make for a better story (although I hope it doesn’t translate to remote calls too) :smile:

I really like when things get a little bit faster (we have a fun internal channel called #graph-trending-downwards where we celebrate things getting more efficient). Many changes I make affect our core libraries and abstractions which means any improvements affect all our services.

Citymapper was a much smaller company so company wide sessions were much easier to coordinate. It was really fun getting folks across the organisation to rally together and build cool things like all the tech in a bus (turns out passenger counters can be quite tricky to get right).

Monzo has really great people who are all really busy building new features and capabilities and improvements which makes this a bit harder. I hope we can do it as a hack-week or similar as folks come back to an office setting. It’s a great way to meet new folks and make friends.

4 Likes

Both have their big complexities. One constant is both have many archaic systems built in the early days of tech and no capacity or appetite to change/improve them.

It makes me very glad for movements like open data for transit and open banking for financial institutions because now the data can’t be hidden away under lock and key. It’s forcing both industries to take tech capabilities seriously (it doesn’t need to be a cost center!)

Tons of autonomy as an individual. There’s no fixed day to day. My goal is to know the problems and potential problems before they become large problems. That means I do a lot of observing and understanding. I spend a lot of time speaking with engineers and product folks across the company to understand where our Platform works well and where we need improvements. I then work with others to prioritise and kick those into action.

I have less input/influence in the product direction or strategy or feature development, but I never want our platform to be a limiting factor in bringing Monzo’s ideas to life. Maintaining our shipping velocity is very important to us because ultimately that’s how things get released to customers.

Within the Collective (that’s our grouping of squads as a high level organisational unit), we have many squads encompassing data, security, core infrastructure and more. Every squad brings things from the bottom up (each individual suggesting things that we believe we should be doing) as well as collective level goals from the top down (what are our goals as a business).

I’m still waiting for the Citymapper Jetpack. One can dream…

Yes, but i’m pretty tame and keep to the stock experience (which is useful because I get the same baseline experience that all customers get)

5 Likes

I see that I’m gonna need to get more inventive to get Monzonauts to spill the secrets… :joy: :male_detective: :mag:

5 Likes

Great to see you here for an AMA @suhailpatel. I have really enjoyed a number of your videos from conferences in the past

I have seen mentions before about the complexity of managing Kubernetes and that on account of :monzo: being an early adopter it has all been a lot more hairy and labour intensive than might be desirable

What is the state of play with “managed” solutions from Amazon and Google and the like? Can you imagine a point where this might become a viable option for :monzo:?

3 Likes

Managing Kubernetes (especially at our scale) is a full time operation. When we started, managed offerings weren’t really available or tenable.

Right now, we’re working on a big project to migrate from our own self hosted Kubernetes cluster to Elastic Kubernetes Service (EKS) by AWS. The feature-set and stability are now identical and it frees up our time to working on other infrastructure improvements. We’ll be having some blog posts on this coming soon :soon:

7 Likes

What advice would you give to someone starting out within platform engineering? And do you suggest any good courses to take? :raised_hands:

4 Likes

Platform Engineering has rapidly become a really broad field. The common theme is gaining experience in figuring out how things fail or fall down.

In terms of courses, I really like

My top tip is to get a few virtual machines on a service like DigitalOcean and setup your own clusters and try out some of the tools like Terraform and Kubernetes. If you have a few Raspberry Pis lying around, those make for good Kubernetes clusters too! Practice really helps with learning.

4 Likes

Thank you for doing the AMA @suhailpatel. I have a few questions on varying subjects that I’d like to ask you.

I run a small digital agency specialising in building and running e-commerce stores, but we do have a handful of full-stack web apps where we use Go on the backend.

While we aren’t building microservices like you, I believe we have a similar set of problems when it comes to platform engineering. We have 30+ applications across all sorts of platforms that we need to build, maintain, deploy and monitor which is becoming a problem as we scale our team, and the number of projects we are working on. I want to get to a place where all our developers are using the same platform so we can standardise tooling and monitoring regardless of the technology and customer, but it’s a hard place to reach with such a small team and so little time.

We use Kubernetes, FluxCD, Prometheus and Grafana, for a few projects, so we are well on our way to being able to produce this platform. How would you go about working out what problems to tackle first? What would the ideal platform and accompanying tooling look like to you if you had to start again?

The following questions may be more engineering focused, but I would appreciate some insight if you feel you can give it.

From the talks you’ve all put out, it seems you favour building build smaller services as components. Do you have situations where you have to mutate state across multiple services synchronously as part of an operation, and if so, how do you manage failures partway through without transactions? Are these just handled on a case-by-case basis?

Another potential problem with the state being across multiple services is that you don’t get the ability to use JOIN queries. A practical Monzo example of this may be listing the bank accounts I’m a member of. Do you have all the state for accounts and my memberships in service.account making that a more complex service, or do you tend to split it down into an additional service.account-member at the expense of more complex RPC to obtain the list of my accounts. How do you tend to solve for the lack of joins for relational data such as this?

Thanks,
Nick.

1 Like

Today is the last day to get questions into @suhailpatel :pray:t3:

1 Like

These are some really great questions. My response to this is going to be a personal response (what I think) vs necessarily what we might do at Monzo.

I don’t think there is a perfect ideal. Within your org, you should have a sense of what is causing you the most pain. Focus your efforts on that. You’ve described a nice set of well understood components there. They are extremely powerful. My recommendation is don’t expose all of it to end users without constraints.

It’s easy to get in a state where you are supporting twenty different uses of Kubernetes which are all subtly different across teams. Having a principle of a paved path and some purposeful friction (ie: come talk to us) on deviating from that paved path is a great way of giving freedom whilst staying in control.

We heavily try and avoid cases where state cascades across components. Partial failures are really tricky to resolve (especially since they may involve third parties who might not have full rollback capabilities).

There’s a few patterns that we heavily practice

  • Idempotency: If an action happens multiple times, the effects should only be applied once. This is especially important because at a Platform level, we do retries
  • Reconciliation: Often, the problem isn’t the inconsistency itself, but the fact that it’s never discovered. Building systems that identify the inconsistency and either alert or try to remediate help significantly here
  • Backfill: If you introduce a new constraint that you want to enforce, you need the ability to iterate through all pre-existing data and set the relevant data. That means you have a clean set of data for new constraints which makes enforcement whole rather than partial

In the example you’ve given handling accounts. Accounts are wide reaching concept at Monzo (think what happens when you want to do joint accounts and business accounts and US accounts etc). If we handled all of it transactionally, the complexities would be quite large.

There’s some more complexity and overhead in the RPCs to meld all of this together, but I argue that’s easier to reason about vs untangling at the data layer. We have a lot of RPCing in our backend, it has a compute cost but is a massive key enabler for new features.

2 Likes

Better get a more personal question than my one about Kubernetes in then!

I work on a data network product based around a web of interconnected nodes

All written using C++ and running under a mix of enterprise Linux and a real time operating system

We mostly use SNMP/MIB to monitor these nodes, but that has its limitations

I did have a little bit of success getting Prometheus running under the Linux based nodes a few years back, but that was a bit of hobby coding of my own and we’re in the embedded space, so it never made it to the wild

Ideally I would like something like vizceral, that shows the network flows clearly, but it has been discontinued and is not particularly suitable for embedded/C++ systems anyway:

Any suggestions? Thanks!

1 Like

Thank you @suhailpatel, that is very helpful.

I was hoping to ask a further few questions (my apologies @ AlanDoe), about a hypothetical scenario on how to compose services. I’m not sure if this related to platform engineering, please don’t answer if not.

Let’s say we have service.account and service.account-member where accounts can have many members. How would you go about composing these services to allow for “listing all the accounts for a given user”? This query is interesting because it requires state from both services to provide a list of results. It could also require sorting and pagination (perhaps not in this exact example) which would be tricky to do by RPCing both services.

An obvious solution might be to just combine the services, but this feels contrary to single responsibility microservices.

Would an acceptable solution be to use an asynchronous process to keep service.account apprised of its members, by storing a (partial) copy of members within its own database/keyspace? In the given scenario, all it would take to obtain the list is one RPC to service.account. Is this a particular pattern you may use?

I’d be really interested to hear more about how Monzo go about composing services and the patterns you commonly use, but maybe that’s better served by a blog post.

Thanks again - I really appreciate your time!