Monitoring of servers

SamuelMoraesF · April 17, 2015, 10:12pm

Hi,

We need to select what monitoring service that we’ll start using, have a lot of these services, like:

Uptime Robot
Pingdom
New Relic
…

I thing that the best solution is Pingdom with the free plan of New Relic.

BTW, what amount of services we’ll need monitoring?

mrz · April 17, 2015, 10:41pm

We should first understand our requirements and then search for a solution that matches those.

Monitoring to me really encompasses the following:

binary checks (up/down, exceed threshold)
trending & threshold alerts

When something breaks I want to be armed with enough data as possible to troubleshoot. Think Nagios + Ganglia + Graphite.

Cloudwatch gets most of this and in a world where I can’t have Datadog, we should leverage Cloudwatch as much as possible. We should have alerts from Cloudwatch into VictorOps.

We have Pingdom for a year (I paid for it) and that’s an excellent tool for offsite website monitoring. It’s plugged into health.mozilla-community.org (which we also have for a year).

New Relic is good but it’s core competency is around application instrumentation and less so around system health/metrics.

I’d argue that even in a cloud world where you can autoscale, you still want metrics around performance. I don’t, however, think I need to get paged when CPU is “high”. I want to know about it and more so it’s usage over time.

(Really, I want Datadog…)

SamuelMoraesF · April 20, 2015, 5:36pm

I created an test account on Datadog. It’s an good solution, I’m Ok to start using it.

majken · April 20, 2015, 5:42pm

In the related discussion on IRC, cost was definitely an issue as well. It’s one thing when it’s just one community’s sites, but we have a lot!

mrz · April 20, 2015, 5:49pm

Agreed, sadly. Price is what’s currently preventing this otherwise totally awesome solution!

SamuelMoraesF · April 20, 2015, 5:55pm

Sadly. How many servers we’ll need to monitor?

comzeradd · April 20, 2015, 6:41pm

We should also think what’s best in the long term. Pingdom is nice but it costs. So we need to decide if it’s good enough to pay it after our current subscription ends at the of this year.

I personally use uptime robot, instead of pingdom, which is practically free. We can combine this with New Relic, which is what most of Mozilla is currently using.

tad · April 21, 2015, 8:07pm

Something we’re missing is basic server monitoring
Disk Usage
CPU
RAM

That kind of stuff
Things like that can be used to predict and prevent a fault before an incident

SamuelMoraesF · April 21, 2015, 11:47pm

First of all, we need to specify what will be monitored. So far:

Server resources(CPU, RAM, Disk Usage)
Uptime

We don’t have any special service to monitor? Docker?

How about logs monitoring?(we have a lot of services/servers, reading logs server by server is tedious) We don’t have an syslog server?

BTW, what monitoring services we’re running yet? just Pingdom and Uptime Robot?

We’ll document all these info in the pad: https://communityit.etherpad.mozilla.org/monitoring

mrz · April 22, 2015, 3:24am

@SamuelMoraesF - at this point, I think you should define our monitoring standards. Propose something for the group to review.

My requirements/wishes:

I’m currently a fan of agent-based monitoring systems where agents check in (vs. configuring the monitoring system)
Should integrate with AWS Cloudwatch
Handle trends over time with threshold alerting (high CPU is not an issue unless it’s abnormal because a graph says so! Disk space > 80% isn’t an issue unless the rate of consumption has changed!)
Needs to integrate with VictorOps

But whatever you do, stay away from Zenoss. At Day Job am looking at Zabbix.

In today’s modern world, centralized logging is typically Logstash (or rather ELK, Elasticsearch, Logstash & Kibana) but that might be overkill. Awesome experience though.

SamuelMoraesF · April 22, 2015, 10:47pm

Integration with VictorOps
Integration with AWS Cloudwatch
Integration with Docker
Uptime checks
“Smart” alerts
Basic monitoring(Disk usage, CPU, RAM…)

More requirements/wishes?

I just want to make sure that is just this that we need in an monitoring solution.

comzeradd · April 23, 2015, 6:15pm

I’m probably missing some context here, but why does it have to integrate with VictorOps?

mrz · April 23, 2015, 6:18pm

We are currently using VO for alerts & paging.

SamuelMoraesF · April 24, 2015, 6:29pm

I think that we can go to the next step and search monitoring services that satisfy these requirements.

Recommendations?

majken · April 24, 2015, 8:55pm

I don’t see answers to the 3rd question on the pad yet though. I would make sure all 3 questions are fully answered before moving on.

SamuelMoraesF · April 29, 2015, 12:05am

Done, I updated the pad

axil42 · April 30, 2015, 2:38pm

If you go to a paid solution make sure to check https://sysdigcloud.com. There is a 7-day trial.

It integrates with AWS, works with agents, has cpu/ram metrics, etc. I don’t know about VO, though.

Topic		Replies	Views
Next step - Monitoring Community Ops	11	3122	August 19, 2015
Monitoring proposal Community Ops	1	1116	December 1, 2015
[Done] Creating a shortlist of infra tools Community Ops	9	1298	November 20, 2014
Monitoring questions/update Community Ops	3	850	February 24, 2014
New Relic Meta	1	711	February 16, 2014

Monitoring of servers

Related topics