Sup720 CPU Utilization at 100%

Unanswered Question
May 3rd, 2010

Hi all,

This weekend our data center switch had a meltdown.  I consoled into the switch and noticed CPU utilization was at 100%.  I did "show proc cpu sort" and arp input was at 14%.  I could not isolate what other process hogged the CPU so I rebooted the switch and it solved the issue.  I also did a bug tool search on this particular IOS and nothing really stands out.  We have this IOS for almost 1 year and half.

I am wondering if there is an IOS command that can backtrace what process hogged the CPU and found the root cause of the problem.

Thanks.

Sup-720 base

IOS: advance IP services 12.2(18)SXF15

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Giuseppe Larosa Mon, 05/03/2010 - 11:14

Hello Kevin,

>> I am wondering if there is an IOS command that can backtrace what  process hogged the CPU and found the root cause

not after a reload, but I understand that you needed to find  a quick workaround.

if possible in similar cases you should take sh proc cpu and sh log and to save them in a text file, before reloading the device (logs could be retrivied on syslog server if the device exports log messages to an external server).

Also sw interrupts could be the main users of cpu resources, the second number in 5 seconds cpu usage says how much cpu is used by SW interrupts so it is possible to have cpu at 100% and sh proc cpu sorted does not show processes for more then 20 percent just to make an example.

Very high cpu usage can be triggered by a bridging loop that leads to a broadcast storm that makes the cpu to process a rapidly increasing amount of broadcast frames ( no TTL exists at OSI layer 2 so frames are not dropped after circulating many times in the loop and each cycle a multiplication effect happens).

Combined to a bridging loop there can be other effects for example on HSRP groups: if two Vlans broadcast domains  are joined the device can receive on Vlan X the multicast HSRP hello packets of Vlan Y and this can cause problems even to C6500 devices. A fixup for this is to use different MD5 password for different HSRP groups in different vlans so that if for any accident two vlans are joined HSRP group N does not consider frames of group M and viceversa.

Hope to help

Giuseppe

lamav Mon, 05/03/2010 - 11:17

Configure the service nagle command, which might help you access the box when the cpu is at 100%. Then you will be able to execute te commands Giuseppe suggests before you have to reboot it.

Victor

Actions

This Discussion