What would you do if something went wrong?

Sometimes even the best planned work can go wrong. Often you can identify the best planned work not by its resistance to failure, but by its resilience or recovery in the face of it.

Reducing the likelihood of things going wrong is one part of planning for a change. Reducing the consequence if the worst happens is another. But one aspect of both that is often overlooked is the value of good process, which can reduce real impact by streamlining our general response to issues arising.

Having rollback plans is a good start, but process can even guide us when they fail. And it’s not all Change Management process. Sometimes other disciplines – such as Major Incident – might need to be invoked when things go awry. So here’s a quick overview of WHAT TO DO WHEN SOMETHING GOES WRONG.

The Rollback or Problem Management

If something goes wrong with a change, the first step is to consider whether to invoke the rollback plan, or its equivalent. This decision may not be completely automatic but should be triggered early and easily. The purpose of a rollback plan is to avoid trying to figure out what to do after something has gone wrong, so in most cases the answer is take advantage of that plan without delay, and then take stock when the situation has stabilised.

Sometimes the thing that has gone wrong after a release may be something that is spotted too late for a rollback, or categorically doesn’t need to be rolled back. If it can be lived with, or managed, you likely want to turn to Problem Management. This offers a way to track the issue, document understanding of root cause of issue, and provide a workaround for the issue, as well as a springboard for getting it permanently fixed, should the opportunity arise.

However, if you can’t live with an issue that has arisen, and the rollback plan isn’t helping or can’t be used, then you might need to consider more significant action.

The Emergency Change Process

If for some reason your rollback plan doesn’t work, you might be looking to do something more drastic, such as some kind of full recovery of the service. In terms of process, this kind of extraordinary action is probably going to be described as an Emergency Change. Doing something is not always better than doing nothing, and when we’re off the beaten track it pays to have a wider understanding of what’s at stake, which is something an Emergency CAB can generally help with.

However, there may be a halfway house available. For example, in ISG we make some exceptions for normal service continuity plans, in that they can be enacted without outside reference to the emergency CAB. That’s all above board from a process point of view – although an Emergency Change record should be logged in this case, it can happen after the fact.

If you’re looking at making an Emergency Change to resolve ongoing issues, pay attention to managing communications! In our experience, more of the feedback we get after unplanned issues relates to how we communicate the situation, than how we resolve it. Sometimes it’s very positive feedback, sometimes… not. We can stand to learn from both!

Is it a Major Incident?

Is the issue being treated by the organisation as a Major Incident? If so, this stage can quickly overtake the others, as Major Incidents will usually be declared centrally, and not by the person running the change that caused them.

As a result, if a Major Incident has arisen as a result of planned work, you could be co-opted to a larger team of people managing it. (This is very likely if you’re someone who would usually be involved in the recovery of that service.)

Here’s where things can seem counterintuitive. The key thing with Major Incidents is that recovering services safely is often treated as more important than recovering them instantly. That’s not to diminish the urgency of the situation, but the risk of having to unpick a bad fix is much more common the further you get from normal operation, and that adds rather than removes delays.

With that in mind, there are likely to be considerations that will be undertaken within this group before any further remedial work is carried out. By this point you’re probably in somewhat unknown territory, so there isn’t necessarily a “go to” action left on the table in any case. But while technically oriented staff often have strong (and usually correct) instincts about which course of action is most likely to yield best results, the temptation to run off and try something needs to be avoided at this point!

Whatever path we take to resolving issues, having a sensible set of processes can help us at the time, but also after the fact. Not only does process tend to yield better contemperaneous logging and tracking, and provide both material and triggers for review, but it also helps to keep any reviews focussed on what matters; where can we tighten things up, to protect us for the next time?

(https://commons.wikimedia.org/wiki/File:Computer_on_fire.svg)