Possible false negative: Broken disks have no node

The following text discusses the possibility of a false negative result, depending on your configuration. Please read it carefully.

Recently we got an interesting bug-report: A broken disk did not get alarmed by the check_netapp_disk container-type-plugin. This false negative happened because of a configuration, where disks are checked per node:

check_netapp_disk container-type -H filer … ‑‑include=~^NODE-B\.

This will check all disks on NODE-B. But a broken disk may not be related to any node and so it wont show up in the results.

In this respect, although the plugin’s output is correct, it is clearly not what the user intended when creating this check.

Note on the pattern in ‑‑include: The dot at the end is important as it ensures that a node such as `NODE-B0` is not accidentally included.

Possible solutions

You could reverse the logic by excluding the other node. E.g. ‑‑exclude=~^NODE-A\. results in (in case of a two-node cluster):

NETAPP DISK CONTAINER TYPE CRITICAL - 24 disks checked, 2 CRITICAL
24 instances skipped because of --include/--exclude settings
NODE-B.1.1.3: broken (CRITICAL)
.1.1.7: broken (CRITICAL)
NODE-B.1.1.19: shared
NODE-B.1.1.21: shared
NODE-B.1.1.16: shared
…

Alternatively, you can configure a dedicated check for nodeless disks using the parameter ‑‑include=~^\.\d. In this example the result would be:

NETAPP DISK CONTAINER TYPE CRITICAL - 1 disk checked, 1 CRITICAL
47 instances skipped because of ‑‑include / ‑‑exclude settings
.1.1.7: broken (CRITICAL)

Possible false negative: Broken disks have no node

Possible solutions

Related Posts

Comments