For the past 3 months, we encountered switching loops twice in our network bringing down the whole system and we cannot trace what's causing it. Our core is running 6500 Sup2 (hybrid) and all the distribution/access switches are 4500 and a one 3750 stack (4 switches). All distribution/access switches are connected to those 2 6500 switches for redundancy and the core switches have etherchannel trunk in between using Gi1/1 and Gi1/2 ports. Two months ago when the first incident happened, the GBIC ports on both core switches randomly transitioned to UDLD err-disabled state. Fortunately we are running UDLD err-disable recovery feature. That thing happened the day after I installed the 3750 Stack into the network.
Now the network behavior is abnormal. The distribution/access switches transitions to loopguard or UDLD state randomly.
When I talked to our MA Vendor, she mentioned (the engineer) that there was an incident before that she handled where a 3750 stacked switches brought down a network.
Anyone of you experienced this? Thanks.
The udld are seen because of unidirectional link, primarily fiber links or bad gibics, The switch whose transmit interface did not fail is placed into an err-disabled state, you shoud check the interfaces where you are receving this messages for bad fiber patch cords or gibics. When the loopguard is enabled due to non receipt of ports it puts the interface in loop inconsistent ports. You may troubleshoot the interfaces first. Also as you stated that there in one stack of 3750, if you have redundant link on stack, remove one for time being and concentrate on the udld and loopguard issues..
Pls rate useful posts.
Thanks for the reply. The UDLD is random. We have tons of 4500 switches and we cannot trace which one is really defective. The loopguard is also very random.
Topology diagram would help ...Also are this issue specific to 4500 switches or they are seen with 6500 switches.
what is current IOS version on 4500 with Sup make??
Then all 4500 Switches and 3750 Stack have connections to two of them. Core1 is the root and HSRP active and Core2 is the secondary/standby.
We see the UDLD in all switches. So we believe that this is not a fiber backbone issue because they were all scattered in different floors. From the MSFC of core1, we see several duplicate IP messages but the duplicate is itself and the MAC address is the virtual MAC of HSRP. So this proves a spanning-tree loop. But which one of them? It already happened twice and we can't afford to have another one. Thanks man.
As per topology the etherchannel is the only link which completes a ring for all other switches. Could you post below items
> running config for etherchannel on both cores including member ports
> show portchannel or show etherchannel output
> show cam agingtime on both core switches
> show cam
The running-config of the core is just the same for the past 5 years. Until I replaced the old 4500 switches with new 4500 switches that can support PoE. The aging time is the default which is 300 seconds. The port-channel is currently down to cut the ring which I think is not a good idea since my colleagues missed the chance of troubleshooting the main issue.
I think the UDLD Errdisable recovery didn't help a lot. I was trying to analyze what happened and I think the UDLD recovery added to the injury of the network. I configured the recovery interval down to 30 seconds because we were running VoIP across the network and we would like fast recovery on the trunk ports. But from what I can see, it affected the STP calculation dramatically. The core is running STP and the distribution switches run RSTP but as far as I can remember base on theories, it will run STP to be backward compatible with the core. I am thinking that before the network converges, another downed trunk will be up to recalculate the loop free path. I think I should bump the recovery timers a little bit more.
I think the loop issue is directly related to UDLD events that you
encountered in the network. When one of the links detects UDLD, which
basically means that one of the receivers is not getting any traffic from
the other. So, from spanning tree perspective, one of the links is down and
it has to recalculate the path. In the process, the device that is reporting
the link as down (receiver not getting anything) will consider the link as
dead in its spanning tree calculations. So, effectively, there will be a
loop as the link is not completely dead. So, as it was suggested earlier, I
think you should investigate the fiber connection between the two core
switches and make sure that they are good. If you have a repeater/patch
panel in between, I would suggest removing them and then monitoring the
As far as your question about the stack interfering with the network is
concerned, I do not think that is the case here. The stack will interfere
only if, for some reason the stacking broke and each switch is acting
independently. But, that is very rare scenario and in that case (most of the
times) even the communication link between the individual switches (stack
link) breaks. If that is what you are observing, then we can investigate
from 3750 stack perspective. If not, I would suggest looking at the fiber