Monzonaut AMA - Theo - Borrowing Data Scientist

Ohhhh - my favourite topic and I can answer a whole bunch in one go - this may be a long one.

The thing I’d take with me is Monzo’s data infrastructure setup - for someone who loves data it’s honestly unreal!

Monzo runs on AWS and uses a key value pair database called Cassandra. This is a NoSQL database and sucks for performing analytics. So almost any time anything happens “an event” is emitted with all the information you could want about that event. That could be you taking out a loan with data about the loan_id, interest rate, loan value, repayment date, it could be making a transaction with the transaction_id, payment scheme, amount etc. or it could be opening the app, sending a message in chat, the list goes on. All events will have a timestamp and the user that performed the action. These events are then stored in Google Cloud Platform.

So we effectively have a complete list of everything that has happened in the past - but it’s unstructured - ie. you can count how many loans we have (by doing select count(*) from loan_created_event) don’t but you can’t see what has happened to those loans over time and perform much detailed analysis. So part of my job in the data team is to “stitch” these events together into a structured “data model” (or table) that’s useful for performing analysis. We do this using SQL (BigQuery) and dbt engineering workflow.

As an example, one of those core models is loan_stats - this has the position of every loan Monzo has ever done and the position of those loans (balance, arrears status, etc.) at the end of every day since it was created. I checked and there are about 50 events (such as loan_created, loan_schedule_updated, loan_payment_made) that all feed into this and about 40 SQL queries that run one after the other (dbt does this scheduling for us) to create loan_stats.

In terms of size, another example of two of our core models are:

  1. ledger_entries which has one row for every money movement ever made and has > 20 billion rows.
  2. user_stats which has one row per user per day since they signed up (so if you signed up 1 year ago you’d have 365 rows in this table) and information about that user on that day. This has > 4 billion rows and >1,000 columns.

I really like this video about what is big data - TLDR; It’s anything that can’t be done on just one computer. BigQuery is able to handle this kind of data because Google just throws more and more computers at the problem.

In total I think we now have > 4,000 data models (individual SQL scripts) now and they all run every morning between midnight and ~10am so you have updated data for the previous day available every morning. Whats more if you want to make a change or add a column - that will have run and backfilled itself by tomorrow.

So we now have nice easy to use datasets that aggregate everything up. We then use looker as our data visualisation tool which most people in the company have access to to perform any analysis they want from the datasets the data team produce. We also use ad hoc SQL queries for ad hoc analysis and google sheets for some things where looker isn’t so great (more complex tasks usually). For building statistical models we then use python mainly - the ML team will spend ~20% of their time here but I don’t work with models too much usually.

In terms of my time I’d say I spend ~20% of my time on the data model piece, ~15% of my time doing analysis in looker or google sheets/ SQL, ~15% of my time writing proposals (based on the analysis) and then ~50% of my time in meetings or discussing product changes. Building products really is a team effort which everyone (data, design, user research, engineering, etc.) feeds into. These tend to come in phases so one week I might only be doing data models then the next might be writing a proposal

Sorry if that was too detailed - I should probably do a talk on this stuff once things open up!

14 Likes