We should first understand our requirements and then search for a solution that matches those.
Monitoring to me really encompasses the following:
binary checks (up/down, exceed threshold)
trending & threshold alerts
When something breaks I want to be armed with enough data as possible to troubleshoot. Think Nagios + Ganglia + Graphite.
Cloudwatch gets most of this and in a world where I can’t have Datadog, we should leverage Cloudwatch as much as possible. We should have alerts from Cloudwatch into VictorOps.
We have Pingdom for a year (I paid for it) and that’s an excellent tool for offsite website monitoring. It’s plugged into health.mozilla-community.org (which we also have for a year).
New Relic is good but it’s core competency is around application instrumentation and less so around system health/metrics.
I’d argue that even in a cloud world where you can autoscale, you still want metrics around performance. I don’t, however, think I need to get paged when CPU is “high”. I want to know about it and more so it’s usage over time.
We should also think what’s best in the long term. Pingdom is nice but it costs. So we need to decide if it’s good enough to pay it after our current subscription ends at the of this year.
I personally use uptime robot, instead of pingdom, which is practically free. We can combine this with New Relic, which is what most of Mozilla is currently using.
@SamuelMoraesF - at this point, I think you should define our monitoring standards. Propose something for the group to review.
My requirements/wishes:
I’m currently a fan of agent-based monitoring systems where agents check in (vs. configuring the monitoring system)
Should integrate with AWS Cloudwatch
Handle trends over time with threshold alerting (high CPU is not an issue unless it’s abnormal because a graph says so! Disk space > 80% isn’t an issue unless the rate of consumption has changed!)
Needs to integrate with VictorOps
But whatever you do, stay away from Zenoss. At Day Job am looking at Zabbix.
In today’s modern world, centralized logging is typically Logstash (or rather ELK, Elasticsearch, Logstash & Kibana) but that might be overkill. Awesome experience though.