munin

Limiting disk iops on a larger Munin server using rrdcached

I've long used Munin for basic resource monitoring on a huge variety of servers. It's simple, reliable, easy to configure, and besides the fact that it uses Perl for plugins, there's not much against it!

Last week, I got a notice from my 'low end box' VPS provider that my Munin server—which is aggregating data from about 50 other servers—had high IOPS and would be shut down if I didn't get it back into an allowed threshold. Most low end VPSes run things like static HTML websites, so disk IO is very low on average. I checked my Munin instance, and sure enough, it was constantly churning through around 50 iops. For a low end server, this can cause high iowait for other tenants of the same server, so I can understand why hosting providers don't want applications on their shared servers doing a lot of constant disk I/O.

Using iotop, I could see the munin-update processes were spending a lot of time writing to disk. And munin's own diskstats_iops plugin showed the same:

Getting Munin-node to monitor Nginx and Apache, the easy way

Since this is something I think I've bumped into at least eight times in the past decade, I thought I'd document, comprehensively, how I get Munin to monitor Apache and/or Nginx using the apache_* and nginx_* Munin plugins that come with Munin itself.

Besides the obvious action of symlinking the plugins into Munin's plugins folder, you should—to avoid any surprises—forcibly configure the env.url for all Apache and Nginx servers. As an example, in your munin-node configuration (on RedHat/CentOS, in /etc/munin/plugin-conf.d, add a file named something like apache or nginx):

# For Nginx:
[nginx*]
env.url http://localhost/nginx_status

# For Apache:
[apache*]
env.url http://localhost/server-status?auto

Now, something that often trips me up—especially since I maintain a variety of servers and containers, with some running ancient forms of CentOS, while others are running more recent builds of Debian, Fedora, or Ubuntu—is that localhost doesn't always mean what you'd think it means.

Fixing Munin's [FATAL ERROR] Lock already exists: /var/run/munin/munin-update.lock. Dying.

Recently, I upgraded one of my CentOS and Ubuntu servers to a new version of Munin 2.0.x, and started getting an error stating that munin-update.lock already exists:

2013/03/25 23:11:02 Setting log level to DEBUG
2013/03/25 23:11:02 [DEBUG] Lock /var/run/munin/munin-update.lock already exists, checking process
2013/03/25 23:11:02 [DEBUG] Lock contained pid '10160'
2013/03/25 23:11:02 [DEBUG] kill -0 10160 worked - it is still alive. Locking failed.
2013/03/25 23:11:02 [FATAL ERROR] Lock already exists: /var/run/munin/munin-update.lock. Dying.
2013/03/25 23:11:02  at /usr/lib/perl5/vendor_perl/5.8.8/Munin/Master/Update.pm line 128

Munin hadn't been updating for a couple weeks, so I finally deleted the existing munin-update.lock file, and munin started running again. If this doesn't help solve your problem, have a look inside the various munin log files in /var/log/munin/ to see if one of them contains more details as to why munin isn't working for you.

Drupal 6.x and PHP 5.3.x - Date Timezone warnings

This morning, I was presented with quite the conundrum: one of my servers suddently started having about 4x the normal MySQL traffic it would have in a morning, and I had no indication as to why this was happening; traffic to the sites on the server was steady (no spikes), and I couldn't find any problems with any of the sites.

munin mysql traffic spike

However, after inspecting the Apache (httpd) error logs for the Drupal 6 sites, I found a ton of PHP warnings on almost all the sites. Something like the following: