[Action] Creating a Change Control process

mrz · October 13, 2014, 1:12am

Now that we have a couple Discourse instances with users expecting a reliable and stable infrastructure, I’d like us to start thinking about a Change Control process. We shouldn’t be making ad-hoc changes without communicating these changes to our end users.

I think having something like this will also help build trust and transparency.

It should be lightweight, nothing overly burdensome. At some point when we have a CI/CD system, many changes can skip this process.

Changes should have a bug
Change Control Team (“CCT”) should meet regularly to discuss change requests. Possibly via email or video.
CCT verifies work, schedules work, plans communication to various end users
The CCT isn’t necessarily the domain experts but helps to ensure quality
does the change make sense? is it needed?
does the change have a rollback plan? a testing plan?

https://wiki.mozilla.org/IT/ChangeControl may be an example.

ps. Might this be a good starting contribution pathway?

mrz · October 17, 2014, 5:48am

I want to bump this thread just a bit here with a proposal.

As we don’t have an over abundance of changes to review right now, I propose that the Change Control Team meeting merge with the existing weekly Triage meeting.

Anyone interested in helping draft what such a process looks like should join!

cso · November 22, 2014, 5:15pm

Did anything ever happen with this?

One thing I’ve observed a lot with the reading I’ve been doing, which I was talking to @majken about, is changes being made that people don’t know about…

In a commercial environment, such things would (potentially) be grounds for dismissal, particularly if they led to any downtime or fallout.

I’m not suggesting the same should apply here, but if the Community resources are going to be held to host websites like bugzilla.org (as is planned) then there’s not really much scope for making changes without people being aware of it.

If something did get drafted, could it be linked to in this thread?

mrz · November 22, 2014, 10:41pm

@cso - we don’t have any current draft (want to help draft one?) but in the interim are tagging bugs with “cct-review”. We review those bugs during triage.

Most certainly agree with your idea around Community resources being held to a high standard.

majken · November 23, 2014, 5:05pm

I bet we have a draft in an etherpad somewhere. @cso - when we’re both around remind me to find it for you. I don’t have time to do it now.

mrz · November 25, 2014, 6:03am

Thought some more on this after bug 1104428.

The essence of change control is communication and documentation (of the change).

Here’s a very abridged version of what we’re using at Lookout. I’d like to get some thoughts on this and perhaps some discussion at the next Team Meeting.

Types of changes:
As we start, I basically see two types of changes:

Standard Change: Any Change that is planned and follows the normal process for approvals. Standard changes go through CCT Review.
Emergency Change: A Change that has an immediate need due to some level of urgency around issue (e.g. Load Balancer failed, needs to be replaced in the 4 hours). Emergency Changes require secondary approval. Approval is usually done by “management” but because of our structure it might make sense to get peer approval or escalate to me (via VictorOps).

Again, the essence is around communications. In the first case, the CCT would review, approve, schedule and communicate the change in advance. In the second case, a peer or senior member of the team has awareness and can assist in communications.

Level of Risk:
Changes have different risk levels associated with them.

Low Risk: Little to no risk of any issues with the targeted change (near-zero impact on uptime)
Medium Risk: The change occurring has a risk of impacting production service uptime and should be scheduled as part of planned downtime maintenance.
High Risk: The change will or almost certainly will result in production impact and must occur in a planned downtime maintenance window.

Risk can be determined by an probability/impact matrix:

Change Requests:
Any Change should have the following documented steps:

Deployment Steps: The actual steps required when deploying a change.
Rollback Steps: The Roll Back steps to get back to the prior configuration or version.
Validation Tests: A Pre and Post set of tests to validate the services functionality.
Peer Nomination: A Peer is someone qualified to review the Change.

Peer Reviewer:
The Peer Reviewer should review the Change for the following:

Is the Risk Level accurate?
Are pre and post Validation Steps accurate?
Do Deployment Steps capture actual steps for deployment?
Do Rollback Steps get the service back up and running?needed
Are all other necessary details provided, requested date, expected results post change, etc.?

Peer Reviewer should either ACK the Change or ask for additional information.

Review Process:
Standard Changes are reviewed weekly. During review, the CCT:

reviews deployment/rollback steps
approves & schedules - (and communicating)
(OR denies & provides reasons why and sends back to requestor for additional information)

What do you think about this process?

cso · November 25, 2014, 7:25am

That all seems very logical I just admit and is very similar to the processes used where I’ve worked.

I also wonder if each change ought to have a Bugzilla bug related to it where all the approvals etc. can happen for accountability.