Checking for runaway processes

We are having a new check proposed by one of our customers who had an issue with a single process eating up all the CPU time on a filer. It’s easy to identify the culprit once you are on the command-line of the filer (priv-mode) by issuing the ps command. To automate that sort of monitoring and getting an alarm immediately if a process is getting out of control we offer check_netapp_processes now.

Example

This is how one would check using the default thresholds:``` $ ./check_netapp_process.pl -H filer -s filer-01 -u admin -p ****** NETAPP_PROCESS CRITICAL - 1064 processes checked, 2 critical and 0 warning idle: cpu0: 102.0 (CRITICAL) idle: cpu1: 102.0 (CRITICAL) ontap_dead_bsd_thre: 1.0 worker_thread_38: 0.0 iswts_sockio: 0.0

very long list deleted

SMBOff […] | worker_thread_38=0.00%;20;50;0;100 iswts_sockio=0.00%;20;50;0;100 wafl_blog_early_kickout_worke=0.00%;20;50;0;100

lot of perfdata delete


Filtering the processes
-----------------------

Since the list is quite  long filters can be set by means of \--exclude and \--include. E.g. if you do **not** want to check the _idle_\-procesess you would configure the check like this:```
$ ./check\_netapp\_process.pl -H filer ... -X ^idle:
```For other tips on how to deal with the very long output of that check have look into the article [Overly Long Outputs.](https://blog.monitoring-plugins.pro/posts/overly-long-outputs/)

Availability
------------

This check will be available in the next unstable version (**3.10.1\_12**) for testing which we will release today. At the moment this check only supports cdot and not 7-mode. If you would like to get this check for 7m too, please provide us with the CLI commands used on 7m to get the process-list. You can try that with check\_netapp\_anycli and send us the `--in` values. (For your reference, these are the commands we use to get the list on **cdot**: `set advanced -confirmations off;node run -node <node-name> -command ps` )

Update-Mode shows performance gain by collector-architecture
IMPORTANT: ServiceProcessor Check does not alarm as expected

Comments