logoMonitoring Solaris with Nagios

Contents

[hide]

Monitoring Solaris with Nagios

There are some caveats to monitoring Solaris with Nagios. The typical approach to this is to use the Nagios Remote Plugin Executor (NRPE) on the Nagios host to remotely run the plugin on the desired server. The installation on NFS11 of NRPE was difficult, and the current version has issues with the SSL library installed on a Solaris 10 system. I chose to utilize the check_by_ssh plugin on the Nagios server, which logs into the remote server via SSH and runs a specific command.

Setup Nagios User

The first step in monitoring any machine remotely is to setup a Nagios user:
On Remote Solaris Server:

useradd -d /home/nagios -m -s /bin/bash nagios
passwd nagios

Share SSL Keys Between Nagios and Remote Server

The next step is to sudo into user Nagios on Nagios01 and SSH into the remote host:
On Nagios01:

su -l nagios
ssh <remote host name or IP>
(accept certificate)

On Remote Solaris Server:

mkdir /home/nagios/.ssh
scp nagios01:/home/nagios/.ssh/id_rsa.pub /home/nagios/.ssh/authorized_keys
chown nagios /home/nagios/.ssh/authorized_keys
chgrp other /home/nagios/.ssh/authorized_keys

The check_by_ssh Method

To use the check_by_ssh method, use the following context to perform the action:
On Nagios01:

check_by_ip -H <remote server name or IP> -C "<command to be executed>

To test to see if the check_by_ssh is working correctly, copy and paste the following into Nagios01:

/usr/lib/nagios/plugins/check_by_ssh -H nfs11 -C "/usr/cluster/bin/scstat -n | grep nfs11"

At this point, we do not have the Nagios Plugins installed on the Solaris server, but any command you can run on the command line, you can now run through Nagios01. In the next series of steps, we will be creating Bash scripts to run on Solaris to perform some actions and return some values back to the Nagios server.

Custom Solaris Plugins

I have put some Solaris specific plugins into Nagios01:/lib/nagios/plugins/. To use such plugins, the easiest way is to create a nagios plugin directory and copy the plugin and support files into that directory:
On Remote Solaris Server:

mkdir /usr/local/nagios/libexec
chown -R nagios /usr/local/nagios/libexec

On Nagios01:

scp /usr/lib/nagios/plugins/* <server name or IP>:/usr/local/nagios/libexec

NOTE: For reasons unknown, Solaris specific plugins have the wrong file path for the nagios plugins directory. You will know what line to modify when you receive a Can't locate utils.pm in @INC. error. Open the file with vi, hit ESC type the line number that error complains about, SHIFT-G to goto that line, and usually the line above this has a Nagios plugin path equal to /lib/nagios/plugins, which is incorrect. Edit the line to read /usr/local/nagios/libexec/.

Backup Nagios Files

Backup the existing Nagios config files on Nagios01:

cd /etc/nagios
cp hosts.cfg hosts.cfg.BAK
cp hostgroups.cfg hostgroups.cfg.BAK
cp services.cfg services.cfg.BAK 
cp checkcommands.cfg checkcommands.cfg.BAK

Edit the checkcommands.cfg file

Add any new commands (for Solaris specific items) into that file in this way (for example):

# 'check_zfs_status' command definition
define command {
        command_name    check_zfs_status
        command_line    $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_zfs.pl -p -f"
}

Do this for any additional commands necessary.

Edit the hosts.cfg file

Add this to the end of the file:

define host{
use generic-host ; Name of host template to use
host_name <host name>
alias <Detailed host name>
address <IP address for FE>
parents rsacto
check_command check-host-alive
max_check_attemps 10
notification_interval 120
notification_period 24x7
contact sysadmin-box
} 

Edit the hostgroup.cfg file

Add this to the end of the file:

define hostgroup{
hostgroup_name solaris-servers
alias NFS Cluster Servers
members <server1's name>,<server2's name>,etc
} 

Edit the services.cfg file

UNDER "Generic Services Definition Templates" add:

define service{
        use                             generic-zfs
        name                            generic-zfs
        service_description             ZFS
        is_volatile                     0
       	check_period                    24x7
        max_check_attempts              3
        normal_check_interval           3
        retry_check_interval            1
        contact_groups                  sysadmin-box
        notification_interval           10
        notification_period             24x7
        notification_options            w,c
        check_command                   check_zfs_status
        register                        0 	; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!

        }

UNDER "BEGIN EDITING HERE" add:

define service{
        use                             generic-zfs
        hostgroup_name                  solaris-servers
}

What to Monitor

Physical Nodes

I have used two Solaris Custom plugins to monitor the ZFS Filesystems and SUN Clusters, each of these are configure in the following ways:
check_zfs.pl

check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_zfs.pl -p -f"

Where:
-p = check zpool. Default is not to check.
-f = check zfs file systems. Default is not to check.
check_scstat.pl

check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_scstat.pl -c 1,1,1 -w 1,1,1"

Where:
-c = Critical Number Of Outages (HOSTS / PATHS / IPMP)
-w = Warning Number of Outages (HOSTS / PATHS / IPMP)

I have used the following stock Nagios plugins also:
generic_disk_space
generic_ping
generic-load

The Cluster

Now, for resources delivered by the SUN Cluster software. Here are all of the networks and VIPs that we are using for this specific cluster:

# cat /etc/hosts 
127.0.0.1       localhost       loghost 
192.168.33.71   nfs11 nfs11.internal.cisdata.net # Cluster Node
192.168.33.72   nfs12 nfs12.internal.cisdata.net # Cluster Node
192.168.44.225  nfs12-v44.internal.cisdata.net
192.168.44.224  nfs11-v44.internal.cisdata.net
192.168.44.55   nfs-fe-44 nfs-fe-44.internal.cisdata.net
192.168.55.17   nfs11-v55.internal.cisdata.net
192.168.55.18   nfs12-v55.internal.cisdata.net
192.168.55.56   nfs-fe-55 nfs-fe-55.internal.cisdata.net
192.168.99.150  nfs-fe-99 nfs-fe-99.internal.cisdata.net
192.168.99.105  nfs11-v99.internal.cisdata.net
192.168.99.106  nfs12-v99.internal.cisdata.net

We are going to make a Nagios host for each of the "nfs-fe-XX" addresses, then check these for the NFS service. I will also run a custom command to determine what host has control of the Cluster.

CHECKCOMMANDS.CFG
Add this in the Solaris area:

# 'check_sc_host' command definition
define command {
       command_name    check_sc_host
       command_line    $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "hostname"
}

HOSTS.CFG
Add this to the end of the file:

define host{
use generic-host
host_name nfs-fe-99
alias nfs-vlan99-vip
address 192.168.99.150
parents rsacto
check_command check-host-alive
max_check_attemps 10
notification_interval 120
notification_period 24x7
contact sysadmin-box
} 

Add an entry for nfs-fe-44 and nfs-fe-55 also.
HOSTGROUPS.CFG
Add this entry to the end of the file:

define hostgroup{
        hostgroup_name  solaris-cluster
        alias           SUN Cluster        
        members         nfs-fe-99,nfs-fe-55,nfs-fe-44
}

SERVICES.CFG
Add this into the appropriate area:

define service{
        use                             generic-sc-host
        name                            generic-sc-host
        service_description             Cluster-Host
        is_volatile                     0
       	check_period                    24x7
        max_check_attempts              3
        normal_check_interval           3
        retry_check_interval            1
        contact_groups                  sysadmin-box
        notification_interval           10
        notification_period             24x7
        notification_options            w,c
        check_command                   check_sc_host
        register                        0 	; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

Then add:

define service{
        use                             generic-sc-host
        hostgroup_name                  solaris-cluster
}

Also add the "solaris-cluster" hostgroup to the following services:
generic-ping
generic-nfs


--Scotts 14:48, 29 October 2009 (UTC)