Monitoring proposal

Yousef and I worked on a monitoring MVP, and we’re looking for some feedback.

For a MVP setup of monitoring, we would need to be able to:

  • See the status of all community ops controlled sites
  • Get alerted via VictorOps when a site goes down
  • Ensure the monitoring solution is scalable and reliable

Other things we’d like monitoring to do in the future:

  • Monitor community sites if desired by owners
  • Store logs from various apps and be able to easily search them (ELK)
  • Monitor legacy VPS’ hardware (storage, memory etc)
  • Monitor mofo sites
  • Automatically respond to events (like cloudwatch)

I suggest changing this from a “monitoring proposal” to Operational Intelligence.

The latter includes things you mentioned like log aggregation (ELK) and trending (think OpenTSDB/tcollector + bosun).

At a later iteration, including Playbooks/Recipes/Manifests for others to use to plugin to this system would be great.

