Monitoring questions/update

##Updates
I’m working with my LOPSA mentor (darkfader on IRC) to set up monitoring. He suggested we use OMD, Open Monitoring Distro, which is pretty much a package of several monitoring tools, all Nagios-based. It’s running right now here, you can log-in with your csa-monitor1 password used in shell. If you don’t have one, talk to me and I’ll get you set up. Check_mk is pretty nice, and all the other tools that it comes with look useful too.
##On-call rotation

Before I say anything else, I realize that this is early, but I’d like to get discussion about this so that once I do get monitoring fully functional, I can just add the time periods and go. Ideally we can use something like PagerDuty, but that’s not cheap, and the services that function like it that do have free plans don’t do what I want.

First of all, who actually wants to be on it? If you do, realize that it could mean (for a while, at least) that you may be woken up at 3AM to fix something. Emphasis on “may”, but you never anticipate problems.

If you want to be in the rotation, what time periods are okay for you? I think we should have somebody on-call 24/7, which will be difficult with so few of us.

##Notifications

I’m trying to figure out the best way to handle notifications. Right now I’m leaning toward using SNS/SES for everything, that way we don’t have to worry so much about keeping our mail servers up, and relying on (apparently slow) SMS gateways. The downside to using SNS for SMS is that it only works in the US, so push notifications may be something else to look into.

I don’t think we should look at SMS notifications at the moment. An email is fine for now. We don’t have the capacity to drop everything and fix an issue immediately. Since the community is made up of volunteers and we have school, uni, day jobs, most of the time there will be a lag when fixing an issue. It will be a long while before we have enough admins to create an on-call rota. For example not all admins would have equal knowledge of all services. So if say Discourse drops, we may have an admin on-call who would not be as familiar with it as another.

To avoid this I think we should take the opportunity now to put as much reliability in to our services as possible. Think of how each service is setup. Think of a way it could break. How can we build our services in a way so that a failover doesn’t cause a panic.

Once we have Discourse, WPMS, Monitoring, Phab, DNS etc all running I will be looking to you guys to start looking at the infrastructure from a reliability perspective. Manual work will be thrown out the window and any monitoring alerts will be caught and acted upon automatically.

Contribution opportunity!

Let’s get a bit of a blurb about monitoring - what skills you need, what skills you learn, and I can help share this out and try and get new contributors. We should also try and work out a give and take arrangement with communities that use our monitoring. Where possible they should try and provide a volunteer that can take some shifts on the on call.

and since I like etherpads… https://communityit.etherpad.mozilla.org/monitoring-volunteer

Be sure to link this pad from other related pads, possibly the main icinga pad (that’s the monitoring software right?) and we should link it from the possible contributions pad.