Re: EEM CPU Applet/TCL

leelove01 · ‎10-11-2010

I have a cisco 6504 that is currently running s72033-advipservicesk9_wan-mz.122-33.SXI1.bin. I know this uses EEM 2.4 and cannot use the ERM so I'm in a pinch as to what to do. This device has a EBGP peer that says they perform certain tasks within their AS to make their BGP routing as optimal as possible. The issue I have with this is that this causes us to receive a ton of BGP updates from them. It normally stays anywhere between 15-30 at any given time but can spike as high as 100+. We receive the full Internet table from them. We have another router that has 2 EBGP peers that we receive the full routing table from as well and it rarely has the same problem that we are having on this one. The problem is that Nagios will do a service check on this devices interfaces. This periodically fails and generates an alarm due to it getting back a non-ok state and state its not-responding. I log into this device and everything looks fine. I might catch a cpu spike from what normally appears to be the BGP Router process. Congestion, and CPU spikes are 2 of the main reasons I can think of a device either being slow to respond or not respond. I know we can make a small change in Nagios to allow it to recheck the service again before alarming but I would really like to get more detailed information on the CPU spikes and the process that's spiking them just so I can confirm if it is indeed the cpu spike stopping nagios service check from alarming. I saw some information on ERM and EEM but then realized our IOS doesn't support ERM and looks like I might be stuck using a tcl if an applet can't do the job. (hooray!!! ) Can anyone assist me with this issue so that I might be able to get more detailed information about when the spikes occur and what process is causing it? I would like to maybe see anything over 70% total cpu over 5 secs, that way maybe I can catch enough information over time to be able to correlate the spikes to nagios alarms. If by some chance its not the BGP causing this problem I want to gather more information that might help point me in the right direction. I would also like to have the information gathered to be emailed to me if possible instead of put into a directory on the router/switch. I appreciate any assistance.

Lee

Joe Clarke · ‎10-11-2010

Your applet here should work, but there is no task name, "BGP." You have a few BGP tasks such as "BGP I/O," "BGP Scheduler," etc. What might work better is to use the SNMP ED to watch overall CPU. When that crosses a threshold, then you run "show proc cpu sorted" and email the results. For example:

event manager applet CPU_Utilization

event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.11.1 get-type exact entry-op gt entry-val 70 exit-op lt exit-val 70 poll-interval 30

action 01.0 cli command "enable"
action 02.0 cli command "sh processes cpu"
action 03.0 syslog msg "The CPU applet has ran (CPU is at $_snmp_oid_val %)"
action 04.0 mail server "x.x.x.x" to "xxx@xxx.com" from "xxxx@xxx.com" subject "BR3 BGP" body "$_cli_result"

leelove01 · ‎10-11-2010

Joe,

Thanks for the response. I just have a couple of questions regarding your post..

event manager applet CPU_Utilization
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.11.1 get-type exact entry-op gt entry-val 70 exit-op lt exit-val 70 poll-interval 30

action 01.0 cli command "enable"
action 02.0 cli command "sh processes cpu"
action 03.0 syslog msg "The CPU applet has ran (CPU is at $_snmp_oid_val %)"
action 04.0 mail server "x.x.x.x" to "xxx@xxx.com" from "xxxx@xxx.com" subject "BR3 BGP" body "$_cli_result"

I can set this to poll the overall cpu in the 5 sec timer correct? Just need to substitue for the correct mib?

Now I understand the entry-op gt entry-val (i believe it to be the threshold that triggers the event?) but the last two I'm hoping you can provide clarification for me. Doe the exit-val 70 mean that it will keep performing the action statements (every 30 seconds as determined by the poll-interval) as long as the cpu is above 70?

Does the poll-interval 30 mean the event will only trigger and run once every 30secs?

Based on what was stated in the initial post what would you suggest be the best method to get further information on what might be causing the nagios service to come back in a non-ok state?

Joe Clarke · ‎10-11-2010

This applet already polls the 5 second CPU time. No modification to the SNMP OID is necessary. The exit criterion states that the applet will not run again until the CPU utilization drops below 70% then climbs above 70% again. Without an exit condition, the applet would fire every 30 seconds as long as the CPU is above 70%. This could spam your inbox. The poll-interval states that the applet will check the CPU utilization every 30 seconds.

If you suspect a CPU spike and would like to figure out the next troubleshooting steps, this applet would be a good first step. Once you identify the spiking process, you can write an applet that dumps "show stack" for that process. That output can be used to determine what the process is doing at the time of the spike.

leelove01 · ‎10-11-2010

Joe,

Thanks. I submitted this for approval before anything is implemented. Once we do implement it hopefully it will work. If I run into any issues, I will definitely put up another post. Thanks again for the assitance. Going to have to take the time and brush up on tcl scripting now.

Lee