Re: postmortem after STP loop

axfalk · ‎04-08-2004

We're in a switched environment with two Cat-6506's on the core and four stackable Cat-2980G's daisy-chained on each floor. we have a total of 7 floors and the first Cat-2980G is fiber patched into the first Cat-6506 and the last Cat-2980G is patched into the other Cat-6506. A couple of days ago we sustained what looked like an STP loop, where the utilization on both core switches shot up to 100% and nobody could logged in. We had to reboot the core switches and the situation stabilized. Presently, we're attempting to find the root cause of this problem. The Cisco engineer claims that because we rebooted the core switches, all info pertinent to the loop was lost and he can't really do much. Is that the case? Is tere a way to track down what caused the loop?

Thanks.

axfalk · ‎04-08-2004

Or may be someone also has heard of bad sups on Cat-6506's.

Thanks again

Craig Norborg · ‎04-13-2004

Not bad sups, but bad OS revisions, have you looked into upgrading yours to one of the latest revisions, both 6500's and 2980's?

Yes, like most problems it would be very difficult to debug a STP loop when it is not there, you could try mapping out your complete spanning-tree for your network, but that can take quite awhile. Have you had your switches logging problems to an external syslog server? I would highly recommend doing this, next time you might get some debugging messages that might indicate where the failure is.

Also, should it happen again, try and plug in to one of the core switches to diagnose the questions. Or maybe instead of rebooting the core's, try taking down each stack of switches one at a time to help figure out where the loop might be.

Not alot you can do right now, but be prepared for the next time it might happen... You never know if someone plugged in a small switch without you knowing it and figured out that was the problem and disconnected while you were rebooting your core switches. You can prevent this from happening by turning on BPDU-Guard globally..

axfalk · ‎04-13-2004

Craig, thanks for your response.

We do have an external syslog server, Cisco Works 2K, however the only error logs that it had came from the floor switches - none came from the core. The errors were saying that the "break" switch ( the one that has one port blocked) is learning the same MAC address from two upstream (core) switches, which basically indicated a loop. Also, with regard to the logs, if both core switches were incapacitated to the point that they could not send the logs to an external syslog server, are we better of keeping the logs locally?

Do you also think that an external switch could have caused such a massive STP loop?

Thanks again for your help.

Craig Norborg · ‎04-14-2004

Have you tracked down the MAC address they were seeing as duplicated? If you do that, you might have an indication as to where to start looking for a loop.

As for "keeping the logs locally", the only way to do that is to keep the switch up. Once you reboot it or cycle the power, your logging history is gone. Yes, there might be things in that log that didn't make it to the syslog server, so in order to see that, if it happens again - you will need to keep the switches up.

Then your back to having a plan of what to do should this happen again, such as unplugging the remote switch stacks one-by-one. If you know what mac-address might have been causing it, that could give you an indication of where to start though...

As for whether an external switch could cause such a loop? Yep, it sure could.

One thing I didn't think to ask, do you have your core switches set to a lower STP priority in order to force them as your roots? Might be a good idea.

axfalk · ‎04-14-2004

Greg, again appreciate your response.

The very first "culprit" MAC address was that of the MSFC's in the root bridge. I said "the first", because once the floodgates were opened, every "break" switch on each floor was reporting a different MAC address (Host xx:xx:xx.. is flapping between 2 ports). So, I initially thought that the MSFC began to race & brought down the Cat-6506. However, that hypothesis did not pan out. The strange thing is that whatever caused this massive network outage got "resolved" by bouncing the switches, which pretty much rules out the config as the culprit.

Yes, the core switches are set to a lower STP priority.

And, speaking of the logs, would it be a good a idea to have logs both locally (up to sev level 4) and on the eternal sys log server (up to sev level 6, as the port chnages to a different state - learning, blocked, etc - are trapped as sev level 6).

Thanks again

Craig Norborg · ‎04-15-2004

Well, it kinda depends on how critical things are as to how much you want to log. If it was a one-time problem, logging alot of messages tends to get messy and you might ignore some important messages about other things buried within there. Also logging can tend to drive the CPU of a system up if your logging too much stuff. I would say enable as much logging as your willing to go through, or try and develop some method of weeding out the flack that's automated. If your going to ignore the logs because there is too much of them, I would say reduce the logging level so you are willing to look at them.

Ciscoworks does a great job of classifying regular logging, doesn't do a good job of debugging output though. You might look for other automated tools before you increase the logging too much.

I would say devloping a plan of what steps to take the next time this occurs (if it does) would be more important than logging a ton of information that might never be looked at or used, but its really a judgement call...

richard.troman · ‎04-29-2004

Hi, i have seen this before and its veryhard to trace the problem once you have rebooted. Craig is right about the syslog etc.

Normally there has to have been a change in your network for a STP loop to apear. They don't normally pop out of thin air.

Do you have a change control system, i.e. where engineers keep trace of work thats going on etc? If not what was happening on that day?

It would only take a badly patched switch to start a STP loop. Even a small switch, placed in the right place could do this.

Also do you run any other makes of switches. I have seen extreme switchs crash cisco networks, as the default STP is different to Cisco's

Hope that helped

steve.busby · ‎05-05-2004

I know this is late in coming, but do you have HSRP running between the core switch MSFCs? Are you then trunking between the switches? IF that trunk were to die, then both MSFCs would go high for each of the VLANs configured for HSRP. What you would see at the user switch would be duplicate MAC addresses from the MSFCs.

smorytko · ‎05-06-2004

Not sure it applies but consider workstations with wireless *and* wired connectivity as an STP loop source. An edge switch might complain to syslog as seeing 2 paths to your MSFC.

axfalk · ‎05-06-2004

Steve,

Thanks for your response. We do not have wireless LAN here. How could a wired workstation be an STP loop source?

axfalk · ‎05-06-2004

Steve,

Thanks for your response.

As a matter of fact we do have HSRP running between the core switch MSFCs and we are running etherchannel between the two core cat-6506 switches. we do not have UDLD configured on that etherchannel (not yet, anyway) and a possibility of etherchannel being down and creating the havoc crossed our mind. However, we still can't figure out the root cause of that.

smunzani · ‎05-13-2004

Can somebody shade more light on HSRP and etherchannel trunks? I had noticed similar issue during a lab testing. It wasn't STP loop but the switches had so many log messages during convergence, it pretty much bogged down the network. Is there is Cisco document explaining why such behavior happens?

steve.busby · ‎05-13-2004

The best way to check if HSRP is your culprit is do a show standby on each MSFC, there should only be one active router for each vlan. If you do two telnet windows side-by-side you should be able to compare each one.

One vlan should be in active mode and identify the other MSFC as standby.

See this link for more detailed information:

http://www.cisco.com/en/US/tech/tk648/tk362/technologies_tech_note09186a0080094afd.shtml

Not sure who or where that wireless/wired comment came from but it wasn't me.

pabeln · ‎05-14-2004

I have worked with similar switch configurations and have found VLAN 1 to be problematic. To resolve this issue I prune VLAN 1 from the trunks that connect each stack to the two 6500's and from the EtherChannel link between the two 6500's. It may not be possible to prune VLAN 1 from the stacks of 2980's, but that's OK, just as long as you prune VLAN 1 from the 6500's. For example, if a stack has VLAN 5 as its native VLAN, then the 6500 trunk statement may look like this:

set trunk 5/3 on dot1q 5,14

with a cooresponding clear statement:

clear trunk 5/3 1-4,6-13,15-1005

Notice that I did not include VLAN 1 in the set trunk statement. It's OK to trunk multiple VLAN's (here I'm trunking VLANs 5 and 14, but not 1).

By pruning VLAN 1 from the trunks my networks do not experience as many spanning tree runaway conditions as before when I included VLAN 1 in my trunk statements.

Also, as another person recommended, you should diagram your network and then go to each switch and show the spanning-tree link costs. Map out all these link costs and look for the pattern exceptions; i.e., where do your links not follow the general pattern? Determine why certain link costs are out of line and then rectify.