Solved: Re: High CPU on Catalyst 6509

v.c.bodenstab · ‎05-30-2008

Hi all,

We are using a couple of 6509's on our distribution layer. Since about 48 hrs, one of these systems is generating HighUtilization alerts in our Ciscoworks LMS 3.0 (specifically DFM 3.0.2). The alerts indicate that one of the CPU's is pretty busy (97% all the time). DFM does not clearly state which CPU this is. So, I've been doing some troubleshooting with NET-SNMP. These are the commands I've been running:

$ snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw enterprises.9.9.109.1.1.1.1

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.2.1 = INTEGER: 3017

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.2.2 = INTEGER: 3001

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.2.3 = INTEGER: 0

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.3.1 = Gauge32: 0

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.3.2 = Gauge32: 10

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.3.3 = Gauge32: 97

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.4.1 = Gauge32: 1

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.4.2 = Gauge32: 7

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.4.3 = Gauge32: 97

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.5.1 = Gauge32: 1

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.5.2 = Gauge32: 7

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.5.3 = Gauge32: 97

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.6.1 = Gauge32: 0

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.6.2 = Gauge32: 10

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.6.3 = Gauge32: 97

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.7.1 = Gauge32: 1

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.7.2 = Gauge32: 7

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.7.3 = Gauge32: 97

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.8.1 = Gauge32: 1

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.8.2 = Gauge32: 7

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.8.3 = Gauge32: 97

As you can see, one of the CPU's is at 97%. Now, I want to find out which. The first two are no problem:

$ snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw entPhysicalDescr.3017

ENTITY-MIB::entPhysicalDescr.3017 = STRING: CPU of Routing Processor

$ snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw entPhysicalDescr.3001

ENTITY-MIB::entPhysicalDescr.3001 = STRING: CPU of Switching Processor

So far, so good. What about the third one?

$ snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw entPhysicalDescr.0

entPhysicalDescr.0: Unknown Object Identifier (Index out of range: 0 (entPhysicalIndex))

So, I've got one CPU (probably the one on the PFC3 on our SUP720 board), which is showing 97% usage. But it's entPhysicalIndex is 0? I don't see this on our other 6509's. Furthermore, how can I investigate what this CPU is so busy with. A 'sh proc cpu' doesn't help, because this shows me the info from the MSFC CPU. BTW: I don't have any noticable networkproblems.

Any tips are much appreciated! Cheers,

Vincent

Ryan Carretta · ‎06-02-2008

Hi Vincent,

At the risk of going slightly off topic, I'd just like to add a little something here. The supervisor has two CPUs on board - one on the route processor (RP or MSFC) and one on the switch processor (SP or PFC). When you issue a 'show proc cpu' in IOS on the 6k you are pulling the CPU stats from the RP, which handles most things we would ordinarily thing about: routing protocol updates, process-switched packets, telnet/SSH sessions, and higher-level functions of the device.

To pull the CPU stats from the SP you'd issue a 'remote command switch show process cpu'. The SP CPU is responsible for the low level goings-on in the switch. It handles IGMP snooping, spanning-tree BPDU processing, etc. It's also responsible for keeping in touch with the line cards via the EOBC (ethernet out-of-band channel).

Getting back on track, in your case it looks like you identified the overutilized CPU to be on the WiSM module. fw_lcp (firmware line card protocol) is the process responsible for communicating via the EOBC with the supervisor. It looks like the WiSM was having trouble communicating with the sup via LCP. Why this happened in the first place is anybody's guess if you can't reproduce it, but hopefully now you'll have a better grasp of what originally happened!

View solution in original post

Collin Clark · ‎05-30-2008

Post the results of the following command.

sh proc cpu | e 0.00% 0.00% 0.00%

This will show all processes that are not a 0.

v.c.bodenstab · ‎05-31-2008

Hi,

Thanks for the reply. This is the output of the command. Looks to me it's not particularly busy...

mu-6509-1#sh proc cpu | excl 0.00% 0.00% 0.00%

CPU utilization for five seconds: 1%/0%; one minute: 1%; five minutes: 1%

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

5 137233748 11129235 12330 0.00% 0.26% 0.20% 0 Check heaps

8 30809012 132392063 232 0.07% 0.03% 0.02% 0 ARP Input

37 22511144 1138680 19769 0.00% 0.03% 0.00% 0 Per-minute Jobs

59 190024 65971217 2 0.07% 0.00% 0.00% 0 Heartbeat Proces

122 86232064 404382109 213 0.15% 0.07% 0.06% 0 IP Input

173 37231752 20394709 1825 0.07% 0.03% 0.00% 0 IPC LC Message H

177 19292156 98107973 196 0.07% 0.02% 0.01% 0 CEF process

178 62432544 201714904 309 0.00% 0.02% 0.02% 0 SNMP ENGINE

275 2914381882438835553 119 0.00% 0.33% 0.35% 0 Port manager per

316 72 290 248 0.55% 0.09% 0.02% 1 Virtual Exec

Do you see anything strange?

Best,

Vincent

Collin Clark · ‎06-02-2008

Vincent-

Taking a look at this, the CPU is not running at 97%. What OID are you using to get CPU utilization?

v.c.bodenstab · ‎06-02-2008

Hi,

I'm using the following command from a Linux box to get the values:

snmpwalk -v3 -a SHA -A xxx -u xxx -l authNoPriv -E 800000090300000BBE574F01 cat6509sw enterprises.9.9.109.1.1.1.1

The OID's I'm getting from the switch are:

enterprises.9.9.109.1.1.1.1.1

enterprises.9.9.109.1.1.1.1.2

enterprises.9.9.109.1.1.1.1.3

for the three CPU's.

Collin Clark · ‎06-02-2008

I show enterprises.9.9.109.1.1.1.1.1 as unsupported, enterprises.9.9.109.1.1.1.1.2 has physical index values, and enterprises.9.9.109.1.1.1.1.3 is 5s intervals (which can cause increased load). Try the following and see what you get

1 Minute

enterprises.9.9.109.1.1.1.1.4

5 Minute

enterprises.9.9.109.1.1.1.1.5

v.c.bodenstab · ‎06-02-2008

Hi all,

I've been doing some extra research on the 6500 chassis last night. I finally found out which CPU was causing the problems.

remote command module 8 show proc cpu

showed me that our WiSM (specifically the CFC) was busy at 97%. The fw_lcp was quite busy :-)

After a powercycle of the module, the CPU usage returned to normal.

Thanks for your help!

cheers,

Vincent

Ryan Carretta · ‎06-02-2008

Hi Vincent,

At the risk of going slightly off topic, I'd just like to add a little something here. The supervisor has two CPUs on board - one on the route processor (RP or MSFC) and one on the switch processor (SP or PFC). When you issue a 'show proc cpu' in IOS on the 6k you are pulling the CPU stats from the RP, which handles most things we would ordinarily thing about: routing protocol updates, process-switched packets, telnet/SSH sessions, and higher-level functions of the device.

To pull the CPU stats from the SP you'd issue a 'remote command switch show process cpu'. The SP CPU is responsible for the low level goings-on in the switch. It handles IGMP snooping, spanning-tree BPDU processing, etc. It's also responsible for keeping in touch with the line cards via the EOBC (ethernet out-of-band channel).

Getting back on track, in your case it looks like you identified the overutilized CPU to be on the WiSM module. fw_lcp (firmware line card protocol) is the process responsible for communicating via the EOBC with the supervisor. It looks like the WiSM was having trouble communicating with the sup via LCP. Why this happened in the first place is anybody's guess if you can't reproduce it, but hopefully now you'll have a better grasp of what originally happened!

v.c.bodenstab · ‎06-03-2008

Hi,

Thanks for the clarification and the explanation of the architecture. I'll be extra alert on monitoring this system for the next couple of months or so.

Cheers,

Vincent

Collin Clark · ‎06-03-2008

Great info Ryan, thanks.

Wilson Samuel · ‎05-30-2008

Hi,

Have you enabled SNMPv2 on the switch? I have had encountered once with Sup 720 which after enabling SNMP v2 had pegged the CPU utilization.

If its feasible for you, please turn off the SNMP on the switch and see if it does help.

HTH,

Please rate all helpful posts.

Kind Regards,

Wilson Samuel

v.c.bodenstab · ‎05-31-2008

Hi Wilson,

I'm only using SNMPv3 to monitor this device. Furthermore, I've looked at our TACACS+ logging and haven't found any records that indicate that something has changed in the config. The strangest thing is, the device just started running on 97% by itself.

Vince