Now that we have a couple Discourse instances with users expecting a reliable and stable infrastructure, I’d like us to start thinking about a Change Control process. We shouldn’t be making ad-hoc changes without communicating these changes to our end users.
I think having something like this will also help build trust and transparency.
It should be lightweight, nothing overly burdensome. At some point when we have a CI/CD system, many changes can skip this process.
Changes should have a bug
Change Control Team (“CCT”) should meet regularly to discuss change requests. Possibly via email or video.
CCT verifies work, schedules work, plans communication to various end users
The CCT isn’t necessarily the domain experts but helps to ensure quality
does the change make sense? is it needed?
does the change have a rollback plan? a testing plan?
I want to bump this thread just a bit here with a proposal.
As we don’t have an over abundance of changes to review right now, I propose that the Change Control Team meeting merge with the existing weekly Triage meeting.
Anyone interested in helping draft what such a process looks like should join!
One thing I’ve observed a lot with the reading I’ve been doing, which I was talking to @majken about, is changes being made that people don’t know about…
In a commercial environment, such things would (potentially) be grounds for dismissal, particularly if they led to any downtime or fallout.
I’m not suggesting the same should apply here, but if the Community resources are going to be held to host websites like bugzilla.org (as is planned) then there’s not really much scope for making changes without people being aware of it.
If something did get drafted, could it be linked to in this thread?
@cso - we don’t have any current draft (want to help draft one?) but in the interim are tagging bugs with “cct-review”. We review those bugs during triage.
Most certainly agree with your idea around Community resources being held to a high standard.
The essence of change control is communication and documentation (of the change).
Here’s a very abridged version of what we’re using at Lookout. I’d like to get some thoughts on this and perhaps some discussion at the next Team Meeting.
Types of changes:
As we start, I basically see two types of changes:
Standard Change: Any Change that is planned and follows the normal process for approvals. Standard changes go through CCT Review.
Emergency Change: A Change that has an immediate need due to some level of urgency around issue (e.g. Load Balancer failed, needs to be replaced in the 4 hours). Emergency Changes require secondary approval. Approval is usually done by “management” but because of our structure it might make sense to get peer approval or escalate to me (via VictorOps).
Again, the essence is around communications. In the first case, the CCT would review, approve, schedule and communicate the change in advance. In the second case, a peer or senior member of the team has awareness and can assist in communications.
Level of Risk:
Changes have different risk levels associated with them.
Low Risk: Little to no risk of any issues with the targeted change (near-zero impact on uptime)
Medium Risk: The change occurring has a risk of impacting production service uptime and should be scheduled as part of planned downtime maintenance.
High Risk: The change will or almost certainly will result in production impact and must occur in a planned downtime maintenance window.
Risk can be determined by an probability/impact matrix: