Monitoring of servers


(Removed Account) #1

Hi,

We need to select what monitoring service that we’ll start using, have a lot of these services, like:

  • Uptime Robot
  • Pingdom
  • New Relic

I thing that the best solution is Pingdom with the free plan of New Relic.

BTW, what amount of services we’ll need monitoring?


(mrz) #2

We should first understand our requirements and then search for a solution that matches those.

Monitoring to me really encompasses the following:

  • binary checks (up/down, exceed threshold)
  • trending & threshold alerts

When something breaks I want to be armed with enough data as possible to troubleshoot. Think Nagios + Ganglia + Graphite.

Cloudwatch gets most of this and in a world where I can’t have Datadog, we should leverage Cloudwatch as much as possible. We should have alerts from Cloudwatch into VictorOps.

We have Pingdom for a year (I paid for it) and that’s an excellent tool for offsite website monitoring. It’s plugged into health.mozilla-community.org (which we also have for a year).

New Relic is good but it’s core competency is around application instrumentation and less so around system health/metrics.

I’d argue that even in a cloud world where you can autoscale, you still want metrics around performance. I don’t, however, think I need to get paged when CPU is “high”. I want to know about it and more so it’s usage over time.

(Really, I want Datadog…)


(Removed Account) #3

I created an test account on Datadog. It’s an good solution, I’m Ok to start using it.


#4

In the related discussion on IRC, cost was definitely an issue as well. It’s one thing when it’s just one community’s sites, but we have a lot!


(mrz) #5

Agreed, sadly. Price is what’s currently preventing this otherwise totally awesome solution!


(Removed Account) #6

Sadly. How many servers we’ll need to monitor?


(Nikos Roussos) #7

We should also think what’s best in the long term. Pingdom is nice but it costs. So we need to decide if it’s good enough to pay it after our current subscription ends at the of this year.

I personally use uptime robot, instead of pingdom, which is practically free. We can combine this with New Relic, which is what most of Mozilla is currently using.


(Tom Farrow) #8

Something we’re missing is basic server monitoring
Disk Usage
CPU
RAM

That kind of stuff
Things like that can be used to predict and prevent a fault before an incident


(Removed Account) #9

First of all, we need to specify what will be monitored. So far:

  • Server resources(CPU, RAM, Disk Usage)
  • Uptime

We don’t have any special service to monitor? Docker?

How about logs monitoring?(we have a lot of services/servers, reading logs server by server is tedious) We don’t have an syslog server?

BTW, what monitoring services we’re running yet? just Pingdom and Uptime Robot?

We’ll document all these info in the pad: https://communityit.etherpad.mozilla.org/monitoring :smile:


(mrz) #10

@SamuelMoraesF - at this point, I think you should define our monitoring standards. Propose something for the group to review.

My requirements/wishes:

  • I’m currently a fan of agent-based monitoring systems where agents check in (vs. configuring the monitoring system)
  • Should integrate with AWS Cloudwatch
  • Handle trends over time with threshold alerting (high CPU is not an issue unless it’s abnormal because a graph says so! Disk space > 80% isn’t an issue unless the rate of consumption has changed!)
  • Needs to integrate with VictorOps

But whatever you do, stay away from Zenoss. At Day Job am looking at Zabbix.

In today’s modern world, centralized logging is typically Logstash (or rather ELK, Elasticsearch, Logstash & Kibana) but that might be overkill. Awesome experience though.


(Removed Account) #11
  • Integration with VictorOps
  • Integration with AWS Cloudwatch
  • Integration with Docker
  • Uptime checks
  • “Smart” alerts
  • Basic monitoring(Disk usage, CPU, RAM…)

More requirements/wishes?

I just want to make sure that is just this that we need in an monitoring solution.


(Nikos Roussos) #12

I’m probably missing some context here, but why does it have to integrate with VictorOps?


(mrz) #13

We are currently using VO for alerts & paging.


(Removed Account) #14

I think that we can go to the next step and search monitoring services that satisfy these requirements.

Recommendations?


#15

I don’t see answers to the 3rd question on the pad yet though. I would make sure all 3 questions are fully answered before moving on.


(Removed Account) #16

Done, I updated the pad :slight_smile:


(Achilleas Pipinellis) #17

If you go to a paid solution make sure to check https://sysdigcloud.com. There is a 7-day trial.

It integrates with AWS, works with agents, has cpu/ram metrics, etc. I don’t know about VO, though.