First up - let me apologise for the inconvenience this bug caused, it’s not acceptable from us - and every incident prompts our engineering team to look back and put additional measures in place to prevent it happening again.
Something we don’t mention too often is how we deal with regressions, and testing around releases, we actually have a few different testing and QA phases on the iOS product team, which was the origin of this defect.
During development, our continuous integration platform runs a suite of unit tests designed to individually test isolated components of our iOS app, for example - that our model objects can correctly deserialise from a JSON object our backend service provides, or that our mobile number formatter correctly formats mobile numbers. We enforce good testing through our Peer review process, so that all new functional code that can be tested, is tested in a useful way. We have a few hundred of these tests currently.
We test all our different API service integrations against pre-canned or “stubbed” data, this ensures that we can quickly verify that the app is handling responses (or errors), and is structuring it’s various network requests correctly. This has saved our bacon a few times, especially during development against new services (this request should be a PUT not a POST, etc)
As we build our user interfaces, we record snapshots, and write tests to compare these during continuous integration and deployment, and screens that don’t match their pre-canned image will cause a failure, and therefore won’t be eligible to be deployed.
We currently have a suite of tests which run a simulated version of the app, tap some buttons, and compare the output of that action. Sadly this area is a little flakey at the moment, so they aren’t running automatically and require us to manually verify that tests pass before a deployment, but we’re working hard to get this into our automatic test pipeline. @jgarnham wrote an excellent blog post on this here: https://monzo.com/blog/2016/04/26/automated-testing/
Smoke / regression tests
As a ‘last line of defence’, before a release candidate build goes out to our TestFlight beta testing channel, we perform a smoke test on the build. This is performed by a human, and runs through a pre-written script of scenarios to ensure they all work as expected. It’s time consuming, but ensures we don’t (usually) miss anything obvious that might have slipped through the nets of the previous tests.
After this, builds go out to our TestFlight beta test superstars, who kick the tyres and report back any oddities, we very rarely make changes between TestFlight and full releases, but if we do ship with something “not quite working as expected”, we’ll continue to iterate with our beta testers before going to the App Store.
Sadly, in this case - the bug was a little more difficult to spot (and slipped through all of these nets). Our app was not correctly reporting back the APNS (Apple Push Notification Service) token to our backend service, in certain scenarios.
This meant that certain users had approved push notification permissions, but we weren’t able to send messages to them.
We were able to reproduce the issue reliably on Friday morning, and worked on a patch hotfix. This was then independently reviewed by 2 other engineers, and tested on a number of internal accounts after reproducing the original issue. After running a smoke test to ensure we hadn’t introduced further regressions, and making sure our full unit test suite passed, we deployed the fix to the App Store, and requested an expedited review in order to minimise the number of users affected. We were lucky enough to receive approval for this, and on Friday evening around 8:00pm the 1.8.1 patch release went live on the App Store.
As a result of this issue, we’ve added further steps to our smoke testing script to cover this explicit case, and we’re investigating what more we can do on the automated testing side to beef this up too.
Sadly, no amount of testing can prevent things falling through the net, we’re making sure our ability and speed at reacting to live issues is second to none, so we can minimise the impact of any issue before it becomes a real problem
As always - happy to answer questions about our testing practices! If you read this whole thing, and still want more, we’re currently hiring for product and testing roles, get in touch via https://monzo.com/careers/ for more info.