How we scaled our data team from 1 to 30 people (part 1)

In the first of a series of posts, our VP of Data @dimitri is sharing an insight into how we’ve been scaling our data team at Monzo

PS we’re hiring!

8 Likes

Every backend engineer at Monzo is in part also a data engineer. Whenever somebody introduces a new backend service, they’re responsible for emitting so-called “analytics events” (logs) which are loaded in real time into BigQuery (There’s a bit more magic that happens here like sanitisation, but that’s a story for another post!). This means we don’t need dedicated data engineers to create data collection pipelines.

I’d love to hear more about how this works in practise :smile:

E.g. How do you identify which events are important to track? How do you keep the formatting of the analytics events consistent? Etc.

2 Likes

As the post mentions, there’s a ton of things going on behind the scenes here - but I’ll point out two quick things.

How do you keep the formatting of the analytics events consistent?

All analytics events are just JSON payloads, which means that we don’t have to worry about different columns in the raw data - we are parsing fields out of the payload (and many of them are common to all analytics events). This also means that the first queries in our analytics pipelines will be pulling fields that we care about out of those payloads.

How do you identify which events are important to track?

A simple rule of thumb that we use is that we track anything that is changing state in the backend. So when a thing gets created, updated, changed, etc., it’s likely that an event will be logged. The nice thing about this is that many of these events are named in a fairly intuitive way, so Data Scientists can often (not always!) find events that they need by searching the available tables in BigQuery.

4 Likes

Great article, thanks for all the insights. I’ve seen the ETL-everything-up-front-to-bigquery approach before, and I wonder why you chose this over Kafka, which seems to have originated this idea?

@dimitri Really great article. I’m intrigued to know how your Data Scientists have taken to the change in workflow away from working in Jupyter Notebook/Python IDE of choice?

I manage Data Scientists and trying to ‘force’ a different way of working on them would likely be met by a lot of backlash.

Hi @duncang I’m not a deep expert on Kafka, but here are my high level thoughts.

  • the main idea is that our data models (the curated tables that we produce based off logs/analytics events) are evolving pretty quickly where we add new dimensions to existing models on a daily basis or even intraday and want to update those models historically with those new dimensions. This means that we need the ability to re-process a lot of historical data efficiently in minutes/seconds. I guess Kafka does it as well, but not sure whether similarly fast and reliably as BigQuery which is optimised for batch processing!? (I’ve never tested it myself actually)
  • if we were to use Kafka for this transformation and BigQuery as the analytics database, this workflow of re-processing all the historical data frequently would be a lot more difficult as it would require us to move a lot of data between the two.
  • (probably the most important reason) because ETL is done by analysts, we want to keep our setup as simple as only possible. I.e. analysts should be able to be productive with standard SQL only. No need to know how to optimise queries, how to scale resources or any other programming languages.

If you have implemented a similar workflow with Kafka, would be curious to hear how it went and what the limitations are.

2 Likes

Hi @jcbennett, as you have pointed out, forcing an approach onto the analysts is unlikely to work.
The answer to your question will depend on the size and maturity of your team. If the team is already established, as a first thing I would try to reason from first principles together with your team how you can be more effective together. Hopefully it will not be too difficult to convince people and to arrive together at the conclusion that individual ad-hoc queries and one off charts in Notebooks are not optimising for the company’s speed but rather for the individual’s speed. If you have established that and people “in theory” agree that you should do things differently than it’s about embedding the new way of doing things into your team’s culture. This you can do by introducing principles and values of how the team works (together with the team of course). Once you have embedded the right behaviours into principles it would be time to tie progression and performance evaluation into those agreed principles. Publicly recognise people who do a good job in optimising for the collective speed and do things in a “good way”. Make sure you have at least one other big proponent of the “new way” of doing things who acts as a role model (or your “change agent”). It’s also important to emphasise constantly that company comes first, than team and than individuals. Lastly, on-boarding period including expectations and goals is the best instrument to onboard people directly into the right behaviour rather then trying to change it once they are already a year in or so :wink:

Btw, we also still use notebooks but the key is to use those only for things that you can’d do in your BI tool of choice, e.g. because you need to use a more complicated algorithms or computations or something like that.

Hope the answer helps a little :slight_smile: Good luck.

6 Likes

That is one amazing post and takes some serious re-reading.

Massive motivational respect for this :+1:

1 Like

@dimitri , great post – I blasted my whole team with it.

One question: what about when there’s a breaking schema change come from the leadership? Say for example a company’s US Sales Territories change in 2018 from [‘East’, ‘West’] to [‘Northwest’, ‘Southwest’, Northeast’, ‘Southeast’].
Pretend your team needs to build an ML model using historical sales data from 2017-present w/ Territory as a feature.

  1. Would this “data model” be:
    1. created with SQL, and
    2. have pre-2018 Territory values mapped to post-2018 values to remove signal associated with time from the Territory feature?
  2. Is this a common occurrence on your team? What is the workflow you follow for when this happens?

This is an ongoing challenge for us, as the ML team is the only customer who needs historical views, and so the ML team spends lots of cycles on updating historical view data models to reflect the taxonomy changes as the come in.

Any advice on this would be helpful!

This is a great article, very insightful. I’ve got 2 questions:

  1. How you guys are ensuring cleanliness of the data? Is there some sort of validation process, quarantine for possibly dirty data etc?

  2. Are you storing your transformed data into a separate BigQuery table or the same one as the original events. I’ve heard of both approaches being used.

Thanks
Sam