Have you ever opened your app at exactly 4pm the day before payday to use our Get Paid Early feature?
Then youāve potentially been a member of an effect called the stampeding herd. This is a term we use when large numbers of our customers open the app within a very short time period.
Iām Jacob, one of Monzoās Staff Engineers, I recently wrote about my path to Staff here. I spend most of my time working on our Borrowing products like loans and overdrafts, but recently I tackled the problem of making our app more resilient to these spikes in app opens.
This blog explores how I built new capabilities to reduce load on our platform before we get overwhelmed, so you can still access and use the most critical parts of the app, and your card continues to work as normal.
Looking forward to picking up any questions that might pop up!
Really interesting article - I am not a tech person per se so might not understand it all fully, but am I right in interpreting it as:
When Monzo gets busy you turn on a feature which temporarily degrades (not noticeably) the level of service to account for demand. E.g slower animations, ancillary services loading to protect core functions?
Question: How does this tie in with scaling, i.e you see a request for +10k more users than normal and dynamically scale the service to accommodate it.
This seems like it doesnāt scale, and just uses existing resource, albeit with a (not to customer eye) impact.
By default we will scale our services to meet additional demand that comes in. 99% of this time this works really well and everything keeps running as normal.
However, occasionally an increase in load can be very sudden (or just very large) and we might not be able to scale quickly enough to meet the demand without some really important things slowing down.
Thatās when this load shedding comes in - we can prioritise the most important services in return for a barely noticeable change in performance for less important features. We can use this to give us more time to scale up to meet demand.
Iād have thought that by now, given the paid early feature has been running for years, the daily 4pm load vs the number of active users is easy to systematically predict ?
There must be some big data available to get this almost right (to +/- x% of load)
Firstly thanks for taking time to reply, much appreciated. So this is a bit like a āred buttonā if you see it wonāt scale in time.
Iām curious on why this needs to exist though. I wouldāve thought you would have previous trends to know when this is likely to happen and when to scale in advance to a tolerant level.
I donāt work in cloud engineering so this isnāt an expert view but Iām still a little curious why this needs to exist.
Is this more about if for example thereās an attack and this prevents total shutdown vs get paid early.
Youāre right! And we have a blog on how we do essentially that here.
Perhaps get paid early isnāt a perfect example of when the feature my blog is about is most helpful, weāve known about that problem for long enough to have more proactive scaling. However, sometimes the reason for a stampeding herd is more surprising than that e.g. if we are having an incident that meant lots of people opened the app to contact support then without the ability to handle a large, unexpected surge in load all those people opening the app might make the problem worse.
We decided to make this investment in protecting critical services when lots of people open the app after reviewing some of our worst historic incidents and deciding that a tool like this could have helped us to recover faster, and with less customer impact. At one point in time Get Paid Early was new and was the cause of some of these incidents so thatās why the example still springs to mind.
Thanks for the clarification. I wondered how the recent (community) post differentiated from the previous (web) blog post, but the clarification explains it isnāt just get paid early - itās ALL load (amplified by Get Paid early if the timing is wrong)
I think I covered this a bit in my reply to David above - but in short we shouldnāt need this for most things that we can predict, but what about the things we canāt?
Another consideration is that some of the policies I developed have quite a meaningful impact on load for almost no customer impact at all (e.g. not calculating lots of expensive things when your phone receives a push notification). These have so little customer impact that we have even considered having them on all the time. Weāve decided not to do that for now, but if we were planning something that would lead lots of people to open the app at once, then we might choose to turn on some of these policies as well as scaling - certain resources within our platform are hard to scale up dramatically just for a short-term spike in load.