We are doing T.37 fax on 2811 and 2911 routers, with calls coming in over two T1 links. Occasionally the unit hits a bug where a call gets stuck and the CPU gradually rises up to near 100 over about 45 minutes. It stays close to 100%, almost all in the DocMSP process, with the unit rejecting all incoming calls until something tears down the stuck call when everything returns to normal.
Cisco TAC have been unable to identify or fix the bug, so we have implemented an EEM script to detect the high CPU and bounce the two T1 links. Here is the script, triggered on the call rejection logs:
The script seems to work fine functionally (tested by having it trigger off a user-defined log event instead of the high CPU event), but it seems that when the CPU is very high the script definitely gets triggered but often just doesn't seem to run. 30 minutes or an hour later, it still hasn't bounced the T1 links.
We have the following config line attempting to give more priority to the EEM script, but it doesn't seem to be helping much:
scheduler allocate 40000 5000
I have also seen mention of a 'scheduler interval' command to allow time for low-priority processes, but that doesn't seem to be available on this platform.
Any suggestions for other ways to give more priority to the EEM script, or better values for the 'scheduler allocate' command?
event manager applet high_cpu_recovery event ioswdsysmon sub1 cpu-proc taskname “DocMSP” op gt val 50 is-percent true period 60 action 1.0 syslog msg "----HIGH CPU DETECTED, BOUNCING T1s----" ... and so on ...
This difference from your script is triggering on IOS system monitor counters rather than a syslog message. The theory being that using the IOS system monitor counters will allow you to watch the CPU utilization for the DocMSP process and run your script before the CPU reaches 100% so there's some CPU left to run it. I don't know if 50% ("val 50" above) is the right number for the threshold, given your long experience with this issue you know what constitutes values that aren't sane for DocMSP CPU utilization.
My syntax above may not be 100% correct, if not it's documented here:
This will not help if, as I propose, the maxrun time is being hit. When the CPU is high, and especially if AAA command authorization is being used, each command can take a long time to execute thus pushing the policy toward its default 20 second maxrun time. I would look at maxrun first, especially if the "show logg" shows the syslog message is being generated.
I'm working on a project that includes basic router configurations. I configurated everything including: line console 0, line vty 0 15 and secret passwords. There are 3 routers in the network and every LAN is going t...
Desire to create Terminal Server ("TS" in this document) out of 2811 Cisco Router with HWIC-16A card (with Octal cables)
Desire to use SSH over Telnet
TS is ip'ed, SSH access configured (to the TS)
Python based Script to BULK Import/Delete devices using Cisco Prime API
Check my Repo on GitHub for all the details ( see below link )