We are using a couple of 6509's on our distribution layer. Since about 48 hrs, one of these systems is generating HighUtilization alerts in our Ciscoworks LMS 3.0 (specifically DFM 3.0.2). The alerts indicate that one of the CPU's is pretty busy (97% all the time). DFM does not clearly state which CPU this is. So, I've been doing some troubleshooting with NET-SNMP. These are the commands I've been running:
$ snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw enterprises.126.96.36.199.1.1.1
SNMPv2-SMI::enterprises.188.8.131.52.184.108.40.206.1 = INTEGER: 3017
SNMPv2-SMI::enterprises.220.127.116.11.18.104.22.168.2 = INTEGER: 3001
SNMPv2-SMI::enterprises.22.214.171.124.126.96.36.199.3 = INTEGER: 0
SNMPv2-SMI::enterprises.188.8.131.52.184.108.40.206.1 = Gauge32: 0
SNMPv2-SMI::enterprises.220.127.116.11.18.104.22.168.2 = Gauge32: 10
SNMPv2-SMI::enterprises.22.214.171.124.126.96.36.199.3 = Gauge32: 97
SNMPv2-SMI::enterprises.188.8.131.52.184.108.40.206.1 = Gauge32: 1
SNMPv2-SMI::enterprises.220.127.116.11.18.104.22.168.2 = Gauge32: 7
SNMPv2-SMI::enterprises.22.214.171.124.126.96.36.199.3 = Gauge32: 97
SNMPv2-SMI::enterprises.188.8.131.52.184.108.40.206.1 = Gauge32: 1
SNMPv2-SMI::enterprises.220.127.116.11.18.104.22.168.2 = Gauge32: 7
SNMPv2-SMI::enterprises.22.214.171.124.126.96.36.199.3 = Gauge32: 97
SNMPv2-SMI::enterprises.188.8.131.52.184.108.40.206.1 = Gauge32: 0
SNMPv2-SMI::enterprises.220.127.116.11.18.104.22.168.2 = Gauge32: 10
SNMPv2-SMI::enterprises.22.214.171.124.126.96.36.199.3 = Gauge32: 97
SNMPv2-SMI::enterprises.188.8.131.52.184.108.40.206.1 = Gauge32: 1
SNMPv2-SMI::enterprises.220.127.116.11.18.104.22.168.2 = Gauge32: 7
SNMPv2-SMI::enterprises.22.214.171.124.126.96.36.199.3 = Gauge32: 97
SNMPv2-SMI::enterprises.188.8.131.52.184.108.40.206.1 = Gauge32: 1
SNMPv2-SMI::enterprises.220.127.116.11.18.104.22.168.2 = Gauge32: 7
SNMPv2-SMI::enterprises.22.214.171.124.126.96.36.199.3 = Gauge32: 97
As you can see, one of the CPU's is at 97%. Now, I want to find out which. The first two are no problem:
$ snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw entPhysicalDescr.3017
ENTITY-MIB::entPhysicalDescr.3017 = STRING: CPU of Routing Processor
$ snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw entPhysicalDescr.3001
ENTITY-MIB::entPhysicalDescr.3001 = STRING: CPU of Switching Processor
So far, so good. What about the third one?
$ snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw entPhysicalDescr.0
entPhysicalDescr.0: Unknown Object Identifier (Index out of range: 0 (entPhysicalIndex))
So, I've got one CPU (probably the one on the PFC3 on our SUP720 board), which is showing 97% usage. But it's entPhysicalIndex is 0? I don't see this on our other 6509's. Furthermore, how can I investigate what this CPU is so busy with. A 'sh proc cpu' doesn't help, because this shows me the info from the MSFC CPU. BTW: I don't have any noticable networkproblems.
Any tips are much appreciated! Cheers,
At the risk of going slightly off topic, I'd just like to add a little something here. The supervisor has two CPUs on board - one on the route processor (RP or MSFC) and one on the switch processor (SP or PFC). When you issue a 'show proc cpu' in IOS on the 6k you are pulling the CPU stats from the RP, which handles most things we would ordinarily thing about: routing protocol updates, process-switched packets, telnet/SSH sessions, and higher-level functions of the device.
To pull the CPU stats from the SP you'd issue a 'remote command switch show process cpu'. The SP CPU is responsible for the low level goings-on in the switch. It handles IGMP snooping, spanning-tree BPDU processing, etc. It's also responsible for keeping in touch with the line cards via the EOBC (ethernet out-of-band channel).
Getting back on track, in your case it looks like you identified the overutilized CPU to be on the WiSM module. fw_lcp (firmware line card protocol) is the process responsible for communicating via the EOBC with the supervisor. It looks like the WiSM was having trouble communicating with the sup via LCP. Why this happened in the first place is anybody's guess if you can't reproduce it, but hopefully now you'll have a better grasp of what originally happened!