The Problem: It's never the API gateway (until it is).
Your monitoring is on point, you have a symfony of alerts with appropriate priority levels, runbooks are written and up-to-date, and your services autoscale like an absolute mother hubbard.
But Johnny Stakeholder doesn't give a damn how sophisticated your stack is. Johnny Stakeholder is going to trigger an incident at 4am, with the only details being a blurry photo of an inscrutable 500 error. Yes, Johnny Stakeholder takes pictures of his screen with his phone. Johnny Stakeholder suggests you deal with it. Johnny Stakeholder is going on a cigarette break and when Johnny Stakeholder gets back he expects it to be fixed.
In this second half of our Incident Response two-parter: what should happen when the pager goes off? We dissect a typical incident (at least, from our experience). How do you organise an effective response? What steps should be taken to understand what the underlying issue is? And what if you're not able to fix it in a reasonable time?