Tuesday, May 26, 2015

Using M/Monit Safely with AWS

In my main role, I am CTO of BitMEX.com, an up-and-coming Bitcoin derivatives exchange. We run on AWS, and we have a number of isolated hosts and a very restrictive and partitioned network.

At BitMEX, we are wary of using any monitoring platforms that could cause us to lose control. This means staying away from as much closed-source software as possible, and only using tried-and-true tools.

One of our favorite system monitoring tools is the aptly-named Monit. Monit is great at keeping on top of your processes in more advanced ways than simple pid-checking; it can check file permissions, send requests to the monitored process on a port (even over SSL), and check responses. It can do full-system monitoring (cpu, disk, memory) and has an easy-to-use config format for all of it.

Sounds good, right? Well, as nice as it is, we need to get notifications when a system fails - and fast. We configured all of our servers to send mail to a reserved email address that would follow certain rules, post a support ticket to our usual system, and ping Slack. But emails on EC2 are notoriously slow because EC2 could so easily be used for spam.

We found our postfix queue getting 30 minutes deep or more because of AWS's throttling. We use external monitoring tools so it's not always a problem; we generally find out about an outage within about 30 seconds. But if there is a more subtle problem (like a server running out of inodes or RAM) but it takes 30 minutes to get the alert, there's a problem.

I started reconfiguring our monit instances to simply curl a webhook so we'd get Slack notifications right away. That's good, but any change in the hook and I'd need to log into every single server and reconfigure. Besides, I could really use a full-system dashboard. What to do?


M/Monit solves this problem. It provides a simple (yeah, not the prettiest, but functional) dashboard for all hosts, detailed analytics, and configurable alerts. Now, I can just configure all of my hosts to alert on various criteria, and configure M/Monit's Alert Rules to take care of figuring out how severe it is, who should be notified, and how.

M/Monit supports actually starting, stopping, and restarting services via the dashboard, but that requires M/Monit to have the capability to connect to each server's Monit (httpd) server. I didn't want this; setting it up correctly is a pain (have fun generating separate credentials per-server for SSL) and it's a potential DoS vector. Thankfully, M/Monit runs just fine if you don't open the port.

Setting up M/Monit is pretty easy and I don't think I need to go through it; I recommend using SSL though, even inside your private network.

Here's how I set it up on each server (assuming existing monit configs):

set mmonit https://<user>:<pass>@<host>:<port>/collector 
  and register without credentials

That's it. Just reload the monit service and it'll start uploading data.

What's the credentials bit? Well, Monit supports automatically creating credentials at first start that M/Monit can use for controlling processes. But I didn't want to support that anyway, so I added this line to disable it. It's not a big deal if you omit it, and the control port is left closed; M/Monit just won't be able to connect, but it will continue monitoring properly.

That's it! I then used the script above as an alert mechanism inside Monit (Admin -> Alerts) and it happily started sending instant notifications to Slack. You can configure just about any type of webhook, mail notification, whatever.

It took surprisingly little time to get this right, and the team behind Monit deserves a lot of credit for this.