How to deal with changing causes

One of the most innovative features of check_netapp_pro is probably the option -rm_ack, which solves the problem of errors not being alarmed for confirmed overall checks. These errors will not be alarmed actively and can therefore be easily overlooked. This switch might soon be replaced by another, more sophisticated approach.

Changing causes and why they are problem

When running overall checks, i.e. checks that verify multiple instances at once, such as volumes, a volume that might not be that important can easily trigger an alarm for being “almost full”. The monitoring system then confirms the alarm, so that no further alarms will be sent. In order to resolve the issue, you probably have to order additional disks. In the meantime, it might happen that another, more important volume of the same check also reaches it’s full capacity. Since the service check is in a CRITICAL state and has already been confirmed, the monitoring system can not detect that the reason for the CRITICAL status has changed. Such changes of reason can cause problems when service acknowledgements are being used and if multiple instances are being checked. In a storage system with a large number of volumes or disks this is almost unavoidable.

This figure shows how changes of cause can occur and how the rm_ack logic reacts accordingly.

Status quo

Let’s take a look at how the current implementation of -rm_ack works. As our help page explains:

\--rm\_ack

    Under what circumstances a service-acknowledgement (problem acknowledgement for a particular service) should be removed. Can be either 'always', 'never', or 'reason-change'. Defaults to 'reason-change'. The value 'reason-change' means, remove a service-acknowledgement only if the reason for the non-ok state has changed (something got worse or an additional instance got into a non-ok state).

By default, -rm_ack is set to reason-change. It tells the plugin to remove the service acknowledgement of the service check that called the plugin in case of changing causes. In order to proceed, the plugin needs to know from which service check it was called and for which host (both can be configured in the cfg file of the monitoring system). This information can be retrieved from environment variables that were set by the Nagios Daemon when we were developing this technique.

However, these variables are not always present in the newer versions since current daemons only export them if explicitly told to do so. Furthermore, other plugins are equally affected by this problem and don’t offer a viable solution. In some environments, service acknowledgements are therefore not used at all and their functionality is handled at a higher level, e.g. a ticketing system.

Possible changes

As we have seen above, there have been a lot of changes since we first implemented -rm_ack and we think it’s time to reconsider how we should handle changing causes. As the plugin instructs the monitoring system to remove a service acknowledgement using a pipe, there have always been some drawbacks to this approach.

A more flexible solution would be to remove the option -rm_ack and replace it with a new option –reason_change=ignore | signal | remove_ack.

This new option –reason_change would control the following logic:

  • _ignore _instructs the plugin not to react to changing causes.

  • signal would include an easily readable string in the message to the monitoring system (stdout, LONGSERVICEOUTPUT) that would signalize that a change of reason has occurred. This string can be identified by operators in the GUI. Furthermore, ticketing systems can use this string to differentiate between creating a new ticket or updating an existing one.``` ./check_netapp_pro Usage -H filer -o volume ‑‑reason_change=signal

    NETAPP_USAGE CRITICAL - 102 volumes checked, 2 critical, 1 warning REASON-CHANGE: 2 reason-changes since last check. vol0 93% (CRITICAL) vol1 95% (CRITICAL) vol5 85% (WARNING) vol4 43% …

  • remove_ack extends signal and instructs the monitoring system to remove any service acknowledgments that might have been set. This option corresponds to –rm_ack=reason-change, that as we now know is not the best solution.


Monitoring Latency and Transfer Rate per Aggregate
Detecting unused LUNs

Comments