All of our servers at work are running Ubuntu 9.10 Server.
I was tasked to add some monitoring tools for our core network, and was given Nagios 3 installed from the Ubuntu repository.
The first thing that threw me off was the fact that all of the documentation you will find for Nagios, is based on a binary install. Alot of example files, scripts, etc. are non-existent with the repo install.
After working out the bugs with the install, and Apache2 multi-ssl sites hack, I started digging into the configuration and writing definitions.
Nagios is a charm. The general concept and design took a little while to grasp but basically is a s follows:
1.Define a host
2. Define a host group
3. Define a host template
4. Define a service
5. Define a service template
——————————————————-
1. Defining a host. I have colored directives in blue with options in red. This is a real host based on our configuration.
define host{
use core-router ; name of the host template used
host_name rtr1.wtc ; host name nagios will use for this host
hostgroups wtc-core, core-routers ; all host groups this host is a member of
alias WTC Core Router ; human readable name to this host
address rtr1.wtc ; the network host name of this host or ip address
}
2. Defining a group. Groups are wonderful. Your Nagios web page can look however you want it. I used different service levels for groups, device types, and locations.
An example group definition, again highlighted for easy reading:
define hostgroup{
hostgroup_name wtc-core
alias WTC Core
}
define hostgroup{
hostgroup_name wtc-core
alias WTC Core
}
3. Defining a host template. This is where the first step of the magic happens.( I know, it’s not really magic) The host template defines the type of keep-alive host service check, generic attributes, contacts, and notifications.
##Template for core routers
define host{
name core-router ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
check_command check-host-alive
max_check_attempts 10
notification_interval 0
notification_period 24×7
notification_options d,u,r
contact_groups network
register 0 ; DONT REGISTER THIS DEFINITION – ITS NOT A REAL HOST, JUST A TEMPLATE!
}
4. Defining a service. We are actually monitoring the availability of a service, so the service will be the command we use to check the service.
NOTE FOR UBUNTU USERS: The default commands are NOT defined in the configuration when you install Nagios 3 from the repository using apt-get! The only working command out of the box is check_ping. You must define the commands in the /path/commands.cgf file
For example :
###This command checks interface interface status UP/DOWN for dell 3424 switches ( the OID is in the service)
define command{
command_name check_snmp
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -P 2c -L noAuthNoPriv -C ********** $ARG1$
}
The $USER1! variable refers to the plugin directory and the $ARG1$ referes to data derived from a host.
A service:
## The only needed variable below is -o <OID> -r <return status, 1 is OK>
define service{
hostgroup_name core-switches
service_description Dell 3424 Switch Port 1 Status
check_command check_snmp! -o IF-MIB::ifOperStatus.12 -r 1
use level-0-service
notification_interval 0
}
A service :
#The following service checks for latency in IP connection.
define service{
use level-0-service ; this is the template which has notification/testing properties
hostgroup_name core-routers,core-switches, servers,core-links,***output omitted
service_description ping latency test
check_command check_ping!40.0,20%!80.0,60% ; This checks latency (OK<40ms>Warning<80ms>Critical)
}
5. A Service template:
define service{
name level-0-service ; The ‘name’ of this service template
service_description used for all core networking equipment, highest level of priority
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service ‘freshness’
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_interval 0 ; Only send notifications on status change by default.
is_volatile 0
check_period 24×7
normal_check_interval 3
retry_check_interval 1
max_check_attempts 4
notification_period 24×7
notification_options w,u,c,r ; w=warning, u=unknown, c=critical, r=recovers, f=flapping, s=scheduled downtime,
contact_groups network
register 0 ; DONT REGISTER THIS DEFINITION – ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
———————————————-
Some topics I did not cover but plan on doing so are notifications ( the network team is receiving email notifications for warnings and critical failures on core and level 1 network equipment, and the support staff has an auto-generated Netsuite case when a lower priority issue arises.) I set up postfix and some other things that I will discuss later.
I plan on learning more and more about Nagios over the next few years, as I will be developing its services and using it more each day.
I had no intentions on writing such a large entry, but once you get started it is hard to stop.