Do you really want to have your developers on call 24/7?

Traditionally, many organisations have interpreted the DevOps ‘shared responsibilities’ principle to mean that the developers in your team should man the 24/7 out-of-hours schedule in addition to their daytime duties. This approach does have its benefits: getting developers to feel the pain of running the product first hand is a sure way to ensure operability is front and centre in sprint planning conversations. Plus, who is better positioned to deploy a hotfix overnight than your lead developer?

Unfortunately, the real world tends to be messier than that. Organisations that are slightly further along the maturity cycle often start seeing the disadvantages of this arrangement: there is more to being on call than responding to critical incidents. Even well-designed monitoring solutions and incident response models yield alerts that are either unactionable by the recipient, or merely require internal communications and 3rd party coordination.

Additionally, incidents tend to stack up and correlate with critical events in daytime delivery workstreams. This, combined with the fact that very few humans are capable of working 24/7, leads us to the main problem: if your senior developers are up all night running the incident management process, can you expect them to deliver their best work the next morning?

So what works better, then?

There is a better solution for being on-call. A solution that works particularly well for modern, high-performing software delivery teams - combining ‘feeling the pain’, prioritising operability when required, and capitalising on the strengths of your team without stretching them too thin.

In this model, there are dedicated DevOps & cloud engineers as members of the product delivery team. These engineers form an on-call schedule that acts as the first response for critical incidents, but with a clear escalation path to the rest of the development team. These engineers wear their operational hat more than anything else: they champion reliable and observable production infrastructure, monitoring and alerting, and Continuous Delivery. If there isn’t enough ongoing infrastructure and operability work in one team, they can be shared across multiple teams - as long as they have enough capacity to give a meaningful contribution to all teams they are a member of - not forgetting the agile ceremonies.

I’ve found that building on-call teams of 3-5 DevOps & cloud engineers, with a primary engineer assigned to each supported development team, provides a good return on investment while also balancing the risks involved in operating critical products.

At Releaseworks, we offer this model as a service. You can sign up for a free, no obligation 30-day trial on our website: Releaseworks DevOps Rapid Response