4507R w/SupV crash troubleshooting

branfarm1 · ‎12-18-2009

Hi there,

I have a 4507R with a Sup-V that crashed yesterday morning. My bandwidth monitoring showed that a downstream switch connected to an HP BladeEnclosure was transmitting about 900Mbps at the time of the crash. My 'show ver' output says "System returned to ROM by abort at PC 0x0". Does anyone know what that means, and where I should look for more information? During the time the downstream switch was transmitting 900Mbps, the CPU utilization on the 4507R was 99-100% -- I couldn't maintain connectivity with the switch long enough to ever find out much of anything. I had to go unplug the fiber cable from the HP switch (Cisco CBS3020) before things calmed down.

I was able to see from 'show proc cpu' that it was the Cat4K Mgmt LoPri that was utilizing the CPU, but I couldn't look into further due to connectivity problems. I was able to grab some logs from my syslog server that showed a number of MACFLAP_NOTIF from 4 other blade switches connected to this 4507R, but I didn't receive anything from the 4507R until it rebooted.

Any ideas on this I can look for to help troubleshoot the cause of the issue?

Thanks in advance,

Brandon

Giuseppe Larosa · ‎12-18-2009

Hello Brandon,

how are you? I hope you are well.

There is a specific document for high cpu on C4500 switches

http://www.cisco.com/en/US/products/hw/switches/ps663/products_tech_note09186a00804cef15.shtml

it says:

>> This show processes cpu output shows that there are two processes that use the CPU— Cat4k Mgmt HiPri and Cat4k Mgmt LoPri . These two processes aggregate multiple platform-specific processes which perform the essential management functions on the Catalyst 4500. These processes process control plane as well as data packets that need to be software-switched or processed.

also the note about number of MACFLAP_NOTIF makes to think of :

a) a possible bridiging loop

b) or some servers teaming active/active but connected to different switches

so also double check all the servers teaming setups with server administrators

Edit:

System returned to ROM by abort at PC 0x0"

program counter 0x0 should not be a legal address for an instruction.

I tried to search. posting sh ver and sh stack in output interpreter may help to explain the crash

Hope to help

Giuseppe

branfarm1 · ‎12-18-2009

Hi Giuseppe,

I'm doing well -- thank you for asking. I hope everything is well for you. Happy holidays!

We use the box topology between our core and access switches, which are all HP CBS3020's. Each server is dual homed to two CBS3020's. I've seen the MACFLAP notification fairly often, but I've never thought much of it because of the teaming.

A bridging loop is possible -- that's a good avenue to investigate. As you probably remember, I have two 4507R's and multiple HP Blade Enclosures with 2 CBS3020's installed in each enclosure. As I mentioned before, I use the box topology between core and access, so the primary 4507R is connected to the primary CBS3020, and the second 4507R is connected to the second CBS3020. My monitoring tool lost contact with the primary CBS3020 in the enclosure that seemed to be causing the issue, and I actually had to manually power cycle the CBS3020 before access was restored. Is it possible the CBS3020 crashed in such a way that a bridging loop was formed? Is there any way for me to verify that?

To answer the other responder: I have 12.2(52)SG on my 4507R's, and 12.2(40)SE2 on the CBS3020's.

--Brandon

Reza Sharifi · ‎12-18-2009

Hi Brandon,

What version of IOS are you running? Below bug and workaround is for SXI (6500), but you may have the same bug in your 4000.

Devices running Cisco IOS may reload with the error message "System returned to ROM by abort at PC 0x0" when processing SSHv2 sessions. A switch crashes. We have a script running that will continuously ssh-v2 into the 3560 then close the session normally. If the vty line that is being used by SSHv2 sessions to the device is cleared while the SSH session is being processed, the next time an ssh into the device is done, the device will crash.

Conditions: This problem is platform independent, but it has been seen on Cisco Catalyst 3560, Cisco Catalyst 3750 and Cisco Catalyst 4948 series switches. The issue is specific to SSH version 2, and its seen only when the box is under brute force attack. This crash is not seen under normal conditions.

Workaround: There are mitigations to this vulnerability: For Cisco IOS, the SSH server can be disabled by applying the command crypto key zeroize rsawhile in configuration mode. The SSH server is enabled automatically upon generating an RSA key pair. Zeroing the RSA keys is the only way to completely disable the SSH server

HTH

Reza