EEM Script The Monitors CPU Levels Triggers Incorrectly

Report Inappropriate Content · ‎05-24-2017

Hello EEM scripting community

I recently constructed a couple EEM scripts with the goal of monitoring CPU utilization on our core 4507 switch. The reason we need this real time monitoring is because VOIP call quality degrades when the CPU utilization gets too high, specifically when CPU #1 goes above 65%. Below I have pasted in the scripts. Also pasted in are results from snmpwalk showing the OID values that are being used within the EEM scripts. Working with Cisco TAC we identified two OIDs that represent CPU #1 of the SUP module in slot 3 and slot 4:

For slot 3 1.3.6.1.4.1.9.9.109.1.1.2.1.4.3000.1

For slot 4 1.3.6.1.4.1.9.9.109.1.1.2.1.4.4000.1

Note that the 4507 has two supervisor modules configured in an active\standby configuration. The specific model of SUP module is WS-X45-SUP8-E, and they each have four CPUs which are numbered 0 through 3.

Here is the problem. When the script triggers I go look at the email notice (and at the switch itself) and don't see any CPU spikes. In fact the CPU has been running at 20% or lower for the past 60 minutes. Since Cisco TAC says I am using the correct OIDs then I am left with a script problem. Please take a look at the script and let me know if I have screwed up the event snmp oid lines, and if so how I should adjust those lines so that the script doesn't trigger when the CPU isn't having a utilization problem.

As a bonus question, how would you recommend I adjust the snmp oid lines so that we don't get flooded with email alerts when a high CPU condition DOES exist. IE only one email alert per 10 minutes of so.

Thanks in advance for your assistance.

The SNMPwalk output

C:\usr\bin>snmpwalk -v 2c -c snmpstring 10.10.20.20 1.3.6.1.4.1.9.9.109.1.1.2.1.4
SNMPv2-SMI::enterprises.9.9.109.1.1.2.1.4.3000.0 = Gauge32: 2
SNMPv2-SMI::enterprises.9.9.109.1.1.2.1.4.3000.1 = Gauge32: 1
SNMPv2-SMI::enterprises.9.9.109.1.1.2.1.4.3000.2 = Gauge32: 1
SNMPv2-SMI::enterprises.9.9.109.1.1.2.1.4.3000.3 = Gauge32: 2
SNMPv2-SMI::enterprises.9.9.109.1.1.2.1.4.4000.0 = Gauge32: 17
SNMPv2-SMI::enterprises.9.9.109.1.1.2.1.4.4000.1 = Gauge32: 5
SNMPv2-SMI::enterprises.9.9.109.1.1.2.1.4.4000.2 = Gauge32: 3
SNMPv2-SMI::enterprises.9.9.109.1.1.2.1.4.4000.3 = Gauge32: 7

C:\usr\bin>

The EEM Scripts

event manager applet HighCPU-Sup-Module-3 authorization bypass
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.2.1.4.3000.1 get-type exact entry-op ge entry-val "65" poll-interval 0.500
action 1.0 cli command "enable"
action 2.0 cli command "show proc cpu history"
action 3.0 mail server "10.10.10.10" to "system_notices@ourcompany.com" from "switchname@ourcompany.com" subject "High CPU Alert on switchname" body "$_cli_result"

event manager applet HighCPU-Sup-Module-4 authorization bypass
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.2.1.4.4000.1 get-type exact entry-op ge entry-val "65" poll-interval 0.500
action 1.0 cli command "enable"
action 2.0 cli command "show proc cpu history"
action 3.0 mail server "10.10.10.10" to "system_notices@ourcompany.com" from "switchname@ourcompany.com" subject "High CPU Alert on switchname" body "$_cli_result"

Joe Clarke · ‎05-25-2017

A few concerns.

1. You are only looking at one CPU when the sup has four. The switch will likely balance load out across the four CPUs. So even if one is at 65%, the switch itself could be running at < 20% overall CPU utilization.

2. You are polling way too aggressively. Half a second could be causing SNMP ENGINE to spike the one CPU you're monitoring. This might also be causing an issue where the switch is not updating the SNMP values properly. If you increase the polling interval (to maybe 10 seconds), do you still see this problem?

3. You should print the value of $_snmp_oid_val in your email, and my guess is that EEM is doing the right thing and triggering when it sees CPU #1 at or above 65%.

Report Inappropriate Content · ‎05-25-2017

Thanks for your feedback Joe.

Yes, I am focusing in on only one of the four CPUs. From past TAC cases with Cisco switching engineers I have been told that CPU #1 is the CPU that handles switching functions. I have asked those same switching engineers if the switching load is configurable, IE if we could issue a command to alter or adjust which CPUs get used by the switching function, but the answer was still no. You would think that the SUP module would be engineered to load balance functions across all CPUs, but apparently not.

After my last false positive email alert I edited the script to only poll once every 10 minutes. The poll interval is now set to "600". This should also keep the script from spamming us too frequently when the CPU really is being over utilized.

I like your idea of printing the SNMP OID value in the body of the email. Can you give me an example of how to do that?

Thanks again for your assistance.

Joe Clarke · ‎05-26-2017

Just add the variable to the body:

body "Value : '$_snmp_oid_val', output : '$_cli_result'"

Are you still seeing the problem when you move the polling interval out?

Report Inappropriate Content · ‎05-30-2017

Thanks so much for letting me know how to adjust the body variable.

At this point I think we are good to go. The script hasn't triggered since I moved the polling interval to 10 minutes. With the latest modification I will be able to tell what the CPU value is the next time this script triggers.

Thanks again!