Monitoring Solaris with Nagios
Contents[hide] |
Monitoring Solaris with Nagios
There are some caveats to monitoring Solaris with Nagios. The typical approach to this is to use the Nagios Remote Plugin Executor (NRPE) on the Nagios host to remotely run the plugin on the desired server. The installation on NFS11 of NRPE was difficult, and the current version has issues with the SSL library installed on a Solaris 10 system. I chose to utilize the check_by_ssh plugin on the Nagios server, which logs into the remote server via SSH and runs a specific command.
Setup Nagios User
The first step in monitoring any machine remotely is to setup a Nagios user:
On Remote Solaris Server:
useradd -d /home/nagios -m -s /bin/bash nagios passwd nagios
Share SSL Keys Between Nagios and Remote Server
The next step is to sudo into user Nagios on Nagios01 and SSH into the remote host:
On Nagios01:
su -l nagios ssh <remote host name or IP> (accept certificate)
On Remote Solaris Server:
mkdir /home/nagios/.ssh scp nagios01:/home/nagios/.ssh/id_rsa.pub /home/nagios/.ssh/authorized_keys chown nagios /home/nagios/.ssh/authorized_keys chgrp other /home/nagios/.ssh/authorized_keys
The check_by_ssh Method
To use the check_by_ssh method, use the following context to perform the action:
On Nagios01:
check_by_ip -H <remote server name or IP> -C "<command to be executed>
To test to see if the check_by_ssh is working correctly, copy and paste the following into Nagios01:
/usr/lib/nagios/plugins/check_by_ssh -H nfs11 -C "/usr/cluster/bin/scstat -n | grep nfs11"
At this point, we do not have the Nagios Plugins installed on the Solaris server, but any command you can run on the command line, you can now run through Nagios01. In the next series of steps, we will be creating Bash scripts to run on Solaris to perform some actions and return some values back to the Nagios server.
Custom Solaris Plugins
I have put some Solaris specific plugins into Nagios01:/lib/nagios/plugins/. To use such plugins, the easiest way is to create a nagios plugin directory and copy the plugin and support files into that directory:
On Remote Solaris Server:
mkdir /usr/local/nagios/libexec chown -R nagios /usr/local/nagios/libexec
On Nagios01:
scp /usr/lib/nagios/plugins/* <server name or IP>:/usr/local/nagios/libexec
NOTE: For reasons unknown, Solaris specific plugins have the wrong file path for the nagios plugins directory. You will know what line to modify when you receive a Can't locate utils.pm in @INC. error. Open the file with vi, hit ESC type the line number that error complains about, SHIFT-G to goto that line, and usually the line above this has a Nagios plugin path equal to /lib/nagios/plugins, which is incorrect. Edit the line to read /usr/local/nagios/libexec/.
Backup Nagios Files
Backup the existing Nagios config files on Nagios01:
cd /etc/nagios cp hosts.cfg hosts.cfg.BAK cp hostgroups.cfg hostgroups.cfg.BAK cp services.cfg services.cfg.BAK cp checkcommands.cfg checkcommands.cfg.BAK
Edit the checkcommands.cfg file
Add any new commands (for Solaris specific items) into that file in this way (for example):
# 'check_zfs_status' command definition define command { command_name check_zfs_status command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_zfs.pl -p -f" }
Do this for any additional commands necessary.
Edit the hosts.cfg file
Add this to the end of the file:
define host{ use generic-host ; Name of host template to use host_name <host name> alias <Detailed host name> address <IP address for FE> parents rsacto check_command check-host-alive max_check_attemps 10 notification_interval 120 notification_period 24x7 contact sysadmin-box }
Edit the hostgroup.cfg file
Add this to the end of the file:
define hostgroup{ hostgroup_name solaris-servers alias NFS Cluster Servers members <server1's name>,<server2's name>,etc }
Edit the services.cfg file
UNDER "Generic Services Definition Templates" add:
define service{ use generic-zfs name generic-zfs service_description ZFS is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 3 retry_check_interval 1 contact_groups sysadmin-box notification_interval 10 notification_period 24x7 notification_options w,c check_command check_zfs_status register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! }
UNDER "BEGIN EDITING HERE" add:
define service{ use generic-zfs hostgroup_name solaris-servers }
What to Monitor
Physical Nodes
I have used two Solaris Custom plugins to monitor the ZFS Filesystems and SUN Clusters, each of these are configure in the following ways:
check_zfs.pl
check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_zfs.pl -p -f"
Where:
-p = check zpool. Default is not to check.
-f = check zfs file systems. Default is not to check.
check_scstat.pl
check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_scstat.pl -c 1,1,1 -w 1,1,1"
Where:
-c = Critical Number Of Outages (HOSTS / PATHS / IPMP)
-w = Warning Number of Outages (HOSTS / PATHS / IPMP)
I have used the following stock Nagios plugins also:
generic_disk_space
generic_ping
generic-load
The Cluster
Now, for resources delivered by the SUN Cluster software. Here are all of the networks and VIPs that we are using for this specific cluster:
# cat /etc/hosts 127.0.0.1 localhost loghost 192.168.33.71 nfs11 nfs11.internal.cisdata.net # Cluster Node 192.168.33.72 nfs12 nfs12.internal.cisdata.net # Cluster Node 192.168.44.225 nfs12-v44.internal.cisdata.net 192.168.44.224 nfs11-v44.internal.cisdata.net 192.168.44.55 nfs-fe-44 nfs-fe-44.internal.cisdata.net 192.168.55.17 nfs11-v55.internal.cisdata.net 192.168.55.18 nfs12-v55.internal.cisdata.net 192.168.55.56 nfs-fe-55 nfs-fe-55.internal.cisdata.net 192.168.99.150 nfs-fe-99 nfs-fe-99.internal.cisdata.net 192.168.99.105 nfs11-v99.internal.cisdata.net 192.168.99.106 nfs12-v99.internal.cisdata.net
We are going to make a Nagios host for each of the "nfs-fe-XX" addresses, then check these for the NFS service. I will also run a custom command to determine what host has control of the Cluster.
CHECKCOMMANDS.CFG
Add this in the Solaris area:
# 'check_sc_host' command definition define command { command_name check_sc_host command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "hostname" }
HOSTS.CFG
Add this to the end of the file:
define host{ use generic-host host_name nfs-fe-99 alias nfs-vlan99-vip address 192.168.99.150 parents rsacto check_command check-host-alive max_check_attemps 10 notification_interval 120 notification_period 24x7 contact sysadmin-box }
Add an entry for nfs-fe-44 and nfs-fe-55 also.
HOSTGROUPS.CFG
Add this entry to the end of the file:
define hostgroup{ hostgroup_name solaris-cluster alias SUN Cluster members nfs-fe-99,nfs-fe-55,nfs-fe-44 }
SERVICES.CFG
Add this into the appropriate area:
define service{ use generic-sc-host name generic-sc-host service_description Cluster-Host is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 3 retry_check_interval 1 contact_groups sysadmin-box notification_interval 10 notification_period 24x7 notification_options w,c check_command check_sc_host register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! }
Then add:
define service{ use generic-sc-host hostgroup_name solaris-cluster }
Also add the "solaris-cluster" hostgroup to the following services:
generic-ping
generic-nfs
--Scotts 14:48, 29 October 2009 (UTC)