Hello, and welcome back to my series of “Monitoring vSphere 5.1 with Nagios/Icinga” series. In part 1 we covered some basics and described the interfaces available to monitor a vSphere environment. In this follow-up, I am going to talk deeper about the first two interfaces mentioned before.
But first: With a choice of 5 interfaces, which is the best one to use?
The answer to this questions heavily deoends on your answer to the question “What do you want to monitor?”. The options you have include ICMP, SSH, vSphere API, SNMP and CIM. Most of the times, you will want to use a combination of there to monitor your environment.
To be integrated into Nagios or Icinga, it is necessary to enable public key authentication for the root user and to copy a public key to the system. The current way to do this in ESXi 5.1 is to append the public key to /etc/ssh/keys-root/authorized-keys:
$ cat ~/.ssh/id_rsa.pub | ssh email@example.com 'cat >> /etc/ssh/keys-root/authorized_keys'
To use this from within Nagios, take the Nagios system’s public key and install it in your ESXi hosts as described above. Now you can run commands like
$ ssh firstname.lastname@example.org esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:00:19.00 e1000e Up 1000Mbps Full 00:19:99:79:d4:38 1500 Intel Corporation 82578DM Gigabit Network Connection
vmnic1 0000:11:05.00 e1000 Up 1000Mbps Full 00:1b:21:51:ca:89 1500 Intel Corporation 82541PI Gigabit Ethernet Controller
vmnic2 0000:11:07.00 e1000 Down 0Mbps Half 00:1b:21:79:99:ea 1500 Intel Corporation 82541PI Gigabit Ethernet Controller
take the output and parse it for the information you need. I could find on or two examples of Nagios/Icinga plugins that use this method but none of them provided a broad spectrum of checks. So it seems this method of checking ESXi hosts is not very common which makes a lot of sense, since the commands and their output formats didn’t remain very consistent over the last few major releases. Like this, whenever something changes in the output, the plugin would have to be rewritten support that new format.
Still, vCenter appliances could be checked via SSH or better rely on the traditional NRPE way.
The vSphere API can be accessed using Perl which is my preferred language on Linux for monitoring and scripting anyway. There seems to be an SDK for Java and Python, too, but according to my very brief research they are not officially supported and might not be compatible to 5.1 (I might as well be wrong, so please double check). There are a few ‘production-ready’ check scripts out there which can easily be integrated into your Nagios/Icinga environment. The most advanced of which – of course this reflects my personal opinion only – comes from op5.org and can be downloaded here. The reason why I consider it to be “the most advanced” is its ease of use and the large amount of checks that can be performed. I ran
for you and uploaded the file: I tried to copy-paste the text here first, but the list is just too long. Here, an example of how it is used:
$ ./check_vmware_api.pl -D 10.173.33.4 -u root -p **** -N vcenter -l cpu -s ready
CHECK_VMWARE_API.PL OK - "vcenter" cpu ready=11.00 ms | cpu_ready=11.00ms;;
As you can see above, the script is pretty easy to use and in addition to the check results it outputs performance data that can be parsed and stored if applicable.
The general problem with using the vSphere API for monitoring is performance. Connecting, authenticating and querying can be quite time consuming. Just to give you an idea: The command as executed above takes around 4 seconds to complete and it covers only a single metric of a single VM!!! There are a few ways to address this:
Possibility 1. Configure a bigger check interval in Nagios/Icinga. This means Nagios/Icinga will not execute the check as frequently but it will also remove load from your monitoring system and improve general check performance.
Possibility 2. Don’t use the
-s <subcommand> switch of the
check_vmware_api.plcommand. As you can see in the help text, it is marked as optional (indicated by square brackets). Leaving it out, the script will perform every “sub-check” of the given command (-l). This way, the overhead of connecting and authenticating will be only be necessary once. So instead of checking CPU Ready, CPU Usage %, CPU Usage MHz and CPU Wait separately, execute this instead:
$./check_vmware_api.pl -D 10.173.33.4 -u root -p **** -N vcenter -l cpu
CHECK_VMWARE_API.PL OK - "vcenter" cpu usage=55.00 MHz(0.87%), wait=39210.00 ms, ready=15.00 ms | cpu_usagemhz=55.00Mhz;; cpu_usage=0.87%;; cpu_wait=39210.00ms;; cpu_ready=15.00ms;;
Possibility 3. The script supports a parameter called
--timeout. It allows to set a time out value seconds after which the script will return no matter if successful or not. You might want to set this value lower resulting in more timeouts but with the advantage of a quicker Nagios/Icinga system.
Possibility 4. Use Nagios/Icinga methods for scaling, for example DNX (Distributed Nagios eXecutor) to scatter checks across multiple systems removing load from every single one and parallelizing them.