Preventing downtime

Today HP Cloud scheduled a nine-hour maintenance window for our AZ. Our servers all lost network connectivity, and were offline for a while. This shouldn’t happen. We should come up with a plan to make sure it doesn’t.

Here’s my idea:

  1. At least two instances for everything, in different AZs. phab-web1 and phab-web2 shouldn’t live in the same region. Mrz told me that HP’s LBaaS supports by-IP, not by-instance like ELB does on AWS.

  2. Distributed monitoring. This is something I’ll try and work on soon.

I’d like to get some additional input on this, as well.

Which AZ was down and which hosts were affected? We do try to mitigate as much as possible to use different AZ. The issue we may face is that we have limited instances, IP addresses, space etc so we would have to be careful when creating multiples across AZ’s.

Phab is actually across all three AZ’s - phab-web1 in AZ1, phab-web2 in AZ2 and phab-db1 in AZ3. So unless AZ3 was under maintenance phab would have worked without issue.

So I had a look for the maintenance and here is the link: https://community.hpcloud.com/status/maintenance/2469

Now the maintenance was across a region rather than an Availability Zone. So how HP Cloud works is that we use two regions (East and West) which are further subdivided in to Availability Zones (AZ1, AZ2, AZ3). I looked at distributing our hosts between regions but we have a couple issues:

  1. Latency across regions (different datacentres) if we were to keep data in sync, monitor, etc
  2. Internal IP address don’t work. Each host in each region will have to have an external IP if they want to talk to eachother. The problem with this is that we have limited external IP addresses so we may use our quota quickly.

It’s a good thing you brought this up as we do need to think about maintenance. Unfortunately there will be times where we will have to experience downtime. We just need to be able to give our users plenty of notice. For now in HP Cloud we don’t have any external user services - Discourse is in AWS. This gives us time to review what we have and what needs to be changed.

…and it’s down again. Their maintenance window ended three hours ago, so I’m not sure what’s going on.

Still down. Opened a bug on Phab but emails aren’t being sent. http://phab.communitysysadmins.org/T61

/cc @mrz, @wdowling. Can one of you please look into this?

I’ve created a case with HP to see what the issue is. Quite odd. Sounds like residual effect from the maintenance.

Just as an update, this is an issue with the node our network is on in HP Cloud. Nothing to do with our configuration. HP is working on a fix as we speak. We really need to work on redundancy, though, to avoid situations like this.