Improvements for the takeover-check

Our quite new check_netapp_takeover has been in the field for several weeks and we have found some issues regarding how it handles API-errors and regarding the presentation of the errors found.

Handling of API-Errors

Sometimes the API returns with a message similar to “A metrocluster check operation is in progress. Wait for it to complete …”. In that case it may make sense to retry getting that data in few seconds. You can do so now by setting --retry_interval to a number of seconds lower than --timeout. In practice I would recommend to set --timeout to a quiete large value like 120 seconds and --retry_interval to something about 20 seconds. This would allow approximately 5 retries. This retry-function is an experimental feature and may be removed if it turns out to not beeing useful in practice.

Presentation of the errors found

It happened that a check found an error while checking the MC-aspects of the takeover capability. In that case an alarm (CRITICAL) was sent but the reason for that alarm was hidden somewhere in the extended service-output. This made the alarm look like a false-poitive. Therefore we have completely changed the presentation and made this check to run like the many overall-checks we have with a summary line at the beginning of the output. Lets see some examples.

Examples

NETAPP\_TAKEOVER  OK - 7 takeover-aspects checked
mc-nodes: ok
mc-lifs: ok
mc-config\_replication: ok
mc-aggregates: ok
mc-clusters: ok
node qhfn0101: connected (ha mode) The the storage failover facility is enabled. Takeover of partner is possible. Takeover by partner is possible.
node qhxn102: connected (ha mode) The the storage failover facility is enabled. Takeover of partner is possible. Takeover by partner is possible.

NETAPP_TAKEOVER  CRITICAL - 6 takeover-aspects checked, 1 critical and 0 warning mc-aggregates: warning   (CRITICAL) mc-nodes: ok mc-lifs: ok mc-config_replication: ok mc-clusters: ok node rbfx801: in_non_HA_mode (non_ha mode) nothing checked


And here is one from the situation, that the API throws an error:

NETAPP_TAKEOVER  UNKNOWN - 5 retries to get MC-data failed - giving up, metrocluster-check-get-iter failed: A metrocluster check operation is in progress. Wait for it to complete and retry this command. […] node SFx7-01: connected (ha mode) The the storage failover facility is enabled. Takeover of partner is possible. Takeover by partner is possible.

node SFx7-02: connected (ha mode) The the storage failover facility is enabled. Takeover of partner is possible. Takeover by partner is possible.


Availability
------------

This change will be part of the next unstable test-release but has **not** been pulished yet.

IMPORTANT: ServiceProcessor Check does not alarm as expected
Bug in 3.10.2: Direct checks are not 7m-compatible any more

Comments