Avoiding load average spike alerts on Munin monitoring

Munin is a server monitoring tool written in Perl. In this post I’ll introduce some monitoring basics and how to avoid unnecessary monitoring alerts on temporary server conditions.

Munin has vibrant plugin community. You can easily write your own plugins, even using shell script. Plugins are autodiscovered, so dropping a script file on the server is enough to create your own monitoring graph. Munin master data collection is driven by cron and by default outputs static HTML files.  It is very easy to setup and secure, being immune to web attacks e.g. what most legacy PHP systems suffer.

Munin has its downsides, too. Like with most open source projects, Munin documentation is rough and not very helpful. The configuration file has its own format and discovering potential variables, values and opportunities is cumbersome. Munin scaling might not be ideal for large operations, both computer-wise and management-wise, but it’s ok up to 10-20 servers. Also Munin alert mechanism is quite naive, and you cannot have alert trigger math functions like e.g. with Zabbix.

This brings to the problem: because Munin lacks any kind of math it can perform on monitoring data, it alerts immediately when some value is out of bounds. This may result to unnecessary alerts on temporary conditions which really don’t need system administrator action.

1. Monitoring your server CPU resources

Screen Shot 2014-09-24 at 14.37.00

Server CPU resource sufficiency is best monitored with load average. When the load average gets too high, your server tasks are delayed and your response times start to suffer. However there might be various causes for temporary load average spikes, which resolve itself and should not be subject for alerting your devops team.

  • Brute force attacks before security starts mitigating them (short period before IPs are blocked)
  • Some periodical task takes more resources than usually
  • Network connection issues or such cause temporary thundering herd problem

Most of these issues are resolved automatically, within seconds or minutes when they begin. However, because Munin naively monitors only the latest load value, it cannot have logic to determine whether the load alert is genuine or temporary. In Zabbix this problem can be avoided by monitoring the minimum load value over a time period.

2. Smoothing monitored value over a time period

Because Munin does not support trigger functions, one must calculate the load average smoothing on the server-side. Below is a sample Python script which will monitoring the minimum load average over four Munin report cycles (20 minutes). Thus, it will alert you only if the load has stayed too high over 20 minutes.

There exists a Python framework to build Munin plugins. But because the use case is so simple, the script can be self-contained.

#!/usr/bin/python
#
# Munin plugin for getting the minimum sytem load avg. of 20 minutes.
#
# We do not want to alert devops if the high load situation
# resolves itself in few minutes.
#
# Compared to load avg (what is usually monitored by default),
# load minimum smoothes out load spikes caused by
#
# - Brute force attacks before security starts mitigating them
#
# - Timed jobs
#
# - Lost of network connection
#
# All these are visible in max load of time period, avg. load of time period,
# but do not affect min load (the base load level).
#
# Because Munin does not support trigger filtering functions, like e.g. Zabbix,
# we do the minimum load filtering on the server side by taking samples and
# then calculating out the minimum.
#
# Installation:
#
#   Put or symlink this script to /etc/munin/plugins/minload
#   chmod a+rx /etc/munin/plugins/minload
#   service munin-node restart
#
# Testing:
#
#   sudo -u munin munin-run minload
#
#

from __future__ import print_function

import os
import json
import io

__author__ = "Mikko Ohtamaa <mikko@opensourcehacker.com>"
__license__ = "MIT"

# This is the file which tracks the load data
PERSISTENT_STORAGE = "/tmp/munin-minload.json"

# Read or construct data persitent data
try:
    data = json.load(io.open(PERSISTENT_STORAGE, "rb"))
except:
    data = []

# munin-cron calls the server every 5 minutes
# well keep the last 4 entries, take min,
# so we get minimum for 20 minutes

data.append(os.getloadavg()[0])
data = data[-4:]
min_load = min(data)

with io.open(PERSISTENT_STORAGE, "wb") as f:
    json.dump(data, f)

print("""graph_title Load min 20 minutes
graph_vlabel load
graph_category system
rate.label rate
rate.value {}
rate.warning 20
rate.critical 40
graph_info The minimum load of the system for last 20 minutes""".format(min_load))

 

 

 

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+