We are doing T.37 fax on 2811 and 2911 routers, with calls coming in over two T1 links. Occasionally the unit hits a bug where a call gets stuck and the CPU gradually rises up to near 100 over about 45 minutes. It stays close to 100%, almost all in the DocMSP process, with the unit rejecting all incoming calls until something tears down the stuck call when everything returns to normal.
Cisco TAC have been unable to identify or fix the bug, so we have implemented an EEM script to detect the high CPU and bounce the two T1 links. Here is the script, triggered on the call rejection logs:
The script seems to work fine functionally (tested by having it trigger off a user-defined log event instead of the high CPU event), but it seems that when the CPU is very high the script definitely gets triggered but often just doesn't seem to run. 30 minutes or an hour later, it still hasn't bounced the T1 links.
We have the following config line attempting to give more priority to the EEM script, but it doesn't seem to be helping much:
scheduler allocate 40000 5000
I have also seen mention of a 'scheduler interval' command to allow time for low-priority processes, but that doesn't seem to be available on this platform.
Any suggestions for other ways to give more priority to the EEM script, or better values for the 'scheduler allocate' command?
event manager applet high_cpu_recovery event ioswdsysmon sub1 cpu-proc taskname “DocMSP” op gt val 50 is-percent true period 60 action 1.0 syslog msg "----HIGH CPU DETECTED, BOUNCING T1s----" ... and so on ...
This difference from your script is triggering on IOS system monitor counters rather than a syslog message. The theory being that using the IOS system monitor counters will allow you to watch the CPU utilization for the DocMSP process and run your script before the CPU reaches 100% so there's some CPU left to run it. I don't know if 50% ("val 50" above) is the right number for the threshold, given your long experience with this issue you know what constitutes values that aren't sane for DocMSP CPU utilization.
My syntax above may not be 100% correct, if not it's documented here:
This will not help if, as I propose, the maxrun time is being hit. When the CPU is high, and especially if AAA command authorization is being used, each command can take a long time to execute thus pushing the policy toward its default 20 second maxrun time. I would look at maxrun first, especially if the "show logg" shows the syslog message is being generated.
[toc:faq]The ProblemOn traditional switches whenever we have a trunk
interface we use the VLAN tag to demultiplex the VLANs. The switch needs
to determine which MAC Address table to look in for a forwarding
decision. To do this we require the switch to do...
[toc:faq]Introduction:Netdr is a tool available on a RSP720, Sup720 or
Sup32 that allows one to capture packets on the RP or SP inband. The
netdr command can be used to capture both Tx and Rx packets in the
software switching path. This is not a substitut...
IntroductionOSPF, being a link-state protocol, allows for every router
in the network to know of every link and OSPF speaker in the entire
network. From this picture each router independently runs the Shortest
Path First (SPF) algorithm to determine the b...