rm_ack – an innovative Switch with the additional touch

The abbreviation rm_ack stands for remove service acknowledgement, which refers to a rather simple algorithm that can be used to identify masked alarms. Masked alarms are a phenomenon that solely appears with checks on multiple instances, such as the simultaneous checking of all volumes on a NetApp device. In a nutshell, during such a check, it can happen that a critical service check in Nagios gets acknowledged (Service Acknowledgement) because of a specific volume. This leads to alert messages that can no longer be sent. In the meantime, a second, maybe more important volume might reach a critical state. But, the overall check has already been acknowledged since it has been marked as critical. As a result, the people that should be notified are no longer receiving any active alarm messages that another volume is about to fail. In other words: the first failure and its acknowledgement is masking all following failures - hence the name of masked alarms.

The algorithm activated by –rm_ack verifies the results from previously run checks and decides if a change in cause compared to the last check has occurred. Should an impairment be detected in one of the instances, the plugin will do the following: 

  • Notice in the Nagios output (GUI and alert) about the fact that a change in cause has occurred and that something has changed.
  • Reset of Service Acknowledgements in Nagios, so that Nagios sends alert messages again.

The following table depicts the course of events of a check with multiple instances (Overall check), where failures or errors are no longer being reported due to the acknowledgement (ACK) of a previous failure.

The RM_ACK function resets the acknowledgements in Nagios. Resets are represented by green arrows in the table.

alarmmaskierung_en


NetApp LUN Latency Monitoring with Nagios
PerfVolume – More Counter, Better Performance

Comments