Possible causes of a "race" condition on a 6509E Core Switch?

markmagruder · ‎07-21-2008

Version: 12.2(17d)SXB8

HW/FirmWare: 4.1/8.1(3)

I am investigating an incident described as a sudden occurrence of a "race" condition on a 6509E core switch. I'm an engineer but not a network engineer. I'm looking for answers and the right questions to ask the net-ops team.

I'm sure these questions beg more questions but here is a start.

1. Are 6509E devices prone to "sudden" race conditions?

2. Where are the device's error logs and command history logs stored?

3. What are the possible causes of a "race" condition in a 6509E Switch that is only stopped by powering down?

Any suggestions on what questions to ask or information to request?

thanks,

Mark

michaelchoo · ‎07-21-2008

What race condition are you referring to? Hardware-related or routing-related?

markmagruder · ‎07-21-2008

Good question. I don't know. The symptoms were no, or very very little, traffic was going through the switch. And two, the management console was non responsive. The fix was power down the switch, let traffic fail over to the "B" redundant switch. The "B" switch had no clue the primary switch/route path was not routing traffic.

Ryan Carretta · ‎07-21-2008

This wasn't a race condition, as such. Race conditions happen when two threads or semaphores can only proceed when they acquire a resource the other has already secured.

You DID experienced a high-cpu condition, whose cause is yet unknown. More often than not this is caused by interrupt traffic (traffic hitting the CPU for one reason or another). When this happens, try consoling into the switch (Telnet/SSH will be unresponsive). Issue the command 'show process cpu' and see if that sheds some light.

michaelchoo · ‎07-21-2008

It may not be "race" condition as such, then. An example of a race condition that immediately popped in my head is when you have BGP neighbor IP address discovery that relies on BGP routing itself, which will cause BGP session to come up for a few seconds before being torn down again, then retry, etc. etc. causing BGP to flap. Race condition affecting h/w typically causes the platform to be stuck in "reboot loops".

Now, back to your problem... You said you couldn't console to it, how about telnet? If I'm not mistaken, console actually uses more CPU cycle than Telnet. So, if your problem is related to high CPU utilisation like the previous poster said, console may not work but telnet *might* (*might* being the operative word here!). If you could telnet to it, then try issuing the "show processes cpu" command as suggested before. Failing that, I think you need to log a case with TAC.

Ryan Carretta · ‎07-21-2008

Other way around, here, if the high CPU is interrupt-triggered. The influx of packets will oversubscribe the link to the ASIC on the 6k that sits between the supervisor port asic and the RP CPU, and the packets will get dropped there. Very few telnet packets will get through.

We always recommend consoling into the device during a high CPU condition.

michaelchoo · ‎07-21-2008

BTW, to answer your question about Cat6500 being more prone to race condition, the answer is "no". It's as prone to any other switch platforms that Cisco has. In fact, in my opinion, Cat6500 is probably the most reliable platform in Cisco switching product portfolio. I've used many of them, and haven't encountered any major issues... yet.

markmagruder · ‎07-22-2008

Yes, the switch was in a High CPU Mode. An engineer was able to SSH into the the console, reported that responses were very very slow and was kicked out/disconnected. Thereafter no one was able to SSH in.

What are likely causes or common reasons of hich CPU conditions in the switch? What kinds of admin commands/mistakes could cause a high cpu condition ?(switch contains fibre and copper cards)

thanks -M

michaelchoo · ‎07-22-2008

There can be many things that cause high CPU utilization. Some of the more common culprits that I've come across are:

- Process Switching on one or more of the highly utilized interfaces.

- Excessive Spanning Tree topology changes (may be due to misconfigured Spanning Tree)

- Excessive routing topology changes, may be due to mutual redistribution that causes partial route feedback/loop requiring constant route recalculation

I think your best course of action is to log a case with TAC, supplying them with "show tech-support" output, somehow.

Edit: Do you have backup configs from both primary and secondary switches? If you do, you might want to compare the two and see if there are any differences. Also, is there any difference in hardware?

markmagruder · ‎07-22-2008

Thank you very much, this is very helpful. I will act on these recommendations.

If you think of anything else please don't hestitate to post an update.

michaelchoo · ‎07-22-2008

Not a problem. Happy to help.

If you've solved the issue, can you please post an update so that we can all learn from your experience too? Thx