3 Steps to Improve Your Incident Response Capability
A robust incident response capability is key for any web application product team who takes their craft seriously. Whether it’s an incident affecting the stability, availability or reliability of the product, the speed and expertise with which it is resolved makes all the difference in keeping your customers happy. Then there are other operational incidents, like information governance or cybersecurity incidents, which you simply cannot afford to get wrong. In this article we’ll explore three steps to take to improve your incident response capability.
Define a high-level procedure
The starting point for any incident management capability is defining an incident response process. If you don’t have an established process, my advice is to start small. Concentrate on defining the key parts in the journey: what is an incident, how incidents are reported, what is the high-level flow of incident response, and where and when are stakeholders kept up to date.
Here are some points to consider in your incident response procedure:
What are the different types of incidents that are relevant to your organisation? Consider security, availability, information governance, clinical safety etc.
What constitutes an incident that requires invoking the incident management procedure?
What are the criticality levels of incidents?
What are the 3rd parties that you likely need to liaise with when responding to incidents? How do you contact them? For modern web applications, this can be your cloud provider and any other third parties that are critical to the functioning of your product - think database providers, DNS providers, CDN providers etc.
Do you need to respond to incidents 24/7? If yes, do you have an on call schedule?
What is the communication plan for stakeholders?
Define roles, responsibilities and escalation paths
Once you know the rough outline of your incident response procedure, the next step is figuring out who in your organisation is supposed to do what. Here is a handy checklist to get you started:
Who in your organisation must respond to the different types of incidents you can encounter?
If you have an on call schedule, who is in it?
In case of an incident, who manages communications with internal stakeholders and customers?
Who acts as the incident manager, and does the incident manager change when the incident is escalated?
Is there a triage function and who runs it?
When incidents need to be reported to third parties, eg. certain customers or regulatory bodies, who sends the reports?
It’s worth creating very clear rules on escalation. In my experience, some incident responders get stuck trying to resolve an incident on their own, when an escalation would likely lead to a faster resolution. For example, consider setting up a rule to escalate all high-priority incidents to more senior stakeholders of your organisation if they are not resolved within a certain time period.
Create playbooks and establish a way of updating them
One of the key points for creating a process - and the prerequisite for improving it - is having it documented. This enables repeatability and consistency between runs and team members.
If you don’t have an established way of documenting processes yet, start with a simple playbook describing a series of steps. Your incident management procedure should include playbooks to be followed for different roles in the incident response capability.
For effectiveness, the playbook should be able to be completed by a single team member, eg. “This is what a Cloud Engineer should do in the event of an availability incident in our product”. Playbooks should reference other playbooks where necessary. It is also important that they are treated as living documents - when circumstances change or improvement opportunities are identified, you should ensure that your team has a way of accessing the latest version of the playbook.