Picking up the Pieces of Adverse Releases

There’s a key distinction that’s often missed, between a failed Change and a failed Release (hereafter, referred to as an Adverse Release).

It’s the difference between making a mistake about whether a Change will deliver value, and making a mistake in the execution of that Change.

Success or failure

Although the value proposition is important – fundamental, you might say – still, when it comes to centralised process we tend to focus on the quality of our Releases. There’s good reason for that, of course; poorly executed Releases will often directly disrupt our service provision.

The Go CAB, which we run weekly, is primarily a safety net to check the preparedness of Releases, and as best as possible to ensure that scheduling conflicts, lack of planning, or misunderstanding of risks don’t contribute to problems for our services.

But all the work around Go CAB is before the fact. Yet, at least some of the time, Releases will still go awry. And so remains the question – what do we do then? What is our remedy, after the fact, for an Adverse Release?

In the first instance we already know what we would do, and have planned for it; we’d execute the rollback. In a worst-case scenario, we might also need to respond immediately with a Major Incident. That process is ready to go at the drop of a hat, if we need to stand it up. However, most of the time when a Release goes wrong it really doesn’t reach the Major Incident threshold.

In practice, when releases go wrong, we will often see a slightly extended outage period and/or the initiation of our rollback plan. But it’s worth noting that Adverse Releases don’t necessarily correspond to any change in the availability of Services.

In many cases the need to roll back a Release may be identified, and acted on, even before there is any noticeable change in the availability or quality of a service. In some other cases a Release may generate an unexpected number of user queries, and be considered Adverse on that account – even if the execution was technically flawless. After all, an unexpected uptick in user queries can indicate a failure to communicate work optimally, which is another component of successfully executing Releases.

There are opportunities to understand and learn from these outcomes but, as you can see, they don’t all have the urgency or the impact to warrant an immediate and overwhelming response. Actually, Release Reviews tend to work better in practice if we put a little bit of time between the Adverse Release and the review.

So whenever one of the Adverse Release thresholds is tripped, the Release is formally marked Adverse, and we then schedule a review. These reviews tend to come around two weeks later, and last around 15-20 minutes. The meetings will typically be between the Change Manager and the relevant Service Owner, or someone acting on their behalf.

Our Change Management process is as important for learning after the fact, as it is in planning before it. So, while folks often have some trepidation about Release Reviews in advance of attending their first, it quickly becomes obvious that the reviews are more about avoiding future issues, than dwelling on past mistakes. (In fact, that’s an approach we take across all of our processes.) Typewriter says Whoops

At the meetings, there are three main things we try to establish.

What was the cause of the Adverse Release?
What will we do to avoid this kind of issue in future?
What improvements can we make to our Change Process?

People are often surprised to be asked about the process during Release Reviews, but processes should grow and change to meet our needs, and often those needs are more obvious after things have gone wrong, than when they are going right.

So that’s how the other half of Release Management is done. Sure, we do what we can to make sure releases are as safe as possible. But then, for those few releases that do go awry, we have a path to review, and avoid it happening again.

As a final note, we review the reviews as well; the proportion of Releases which are Adverse should be reducing over time – that’s our best indicator from the data, as to whether we’re on the right path.

("Crossroads: Success or Failure" attributed to ccPixs.com )

("Whoops" original photo and the license)