Large network outage caused by A-UDLD - Please Help.

Unanswered Question
Jun 26th, 2007

Hi,

Yesterday we had a strange outage. 45 Access layer switches (2950C's) all lost connectivity to the network. Not all switches terminate on the same distribution layer device (6500).

We had to visit every switch and discovered that the uplink port (fa0/25) had error diabled due to the following:

%ETHCNTR-3-LOOP_BACK_DETECTED

I have only recently enable aggressive mode UDLD and is the only commonality between the switches affected, i.e they all connect back to a native IOS 6500 with A-UDLD enabled globally.

All the access switches still had normal mode UDLD enabled. I was planning on changing them to aggressive at a later date.

I know that the loopback error message is supposed to be caused by a loop in the keepalive packet. Is there any chance that this condition could have been caused by the two different modes of UDLD interacting ?

I cannot see any other reason as to why 45 switches, in different locations would all lose connectivity to the network at the same time.

Any help is appreciated.

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
cbeswick Wed, 06/27/2007 - 23:57

Hi,

Thanks for your response. I would have agreed with you, and is the first thing I checked for when investigating this. However, at the same time that the distribution switch generated the following message:

UDLD disabled interface Fa3/15, aggressive mode failure detected

udld error detected on Fa3/15, putting Fa3/15 in err-disable state

Attempting to recover from udld err-disable state on Fa3/15

The access switch generated the following messages:

%ETHCNTR-3-LOOP_BACK_DETECTED: Keepalive packet loop-back detected on FastEthernet0/25.

%PM-4-ERR_DISABLE: loopback error detected on Fa0/25, putting Fa0/25 in err-disable state

Which would lead me to believe that not all is what it seems. I have serious misgivings about A-UDLD and how it interacts with devices configured only for standard UDLD. Perhaps it has something to do with the timing intervals that A-UDLD and standard UDLD use. As we know, keepalives use a similar "echo" mechanism to UDLD and perhaps the difference in UDLD timers is causing the keepalive to malfunction.

I am not convinced that this is purely a keepalive issue.

Hi,

Yes probably there is some relation between the two thing. As far as I know Agressive UDLD can work together with the standard UDLD. The difference between the two is that A-UDLD disable the ports on both sides if 8 retries exceeded until the standard one error-disables only one side of the link but in general uses the same mechanism.

Hope it helps,

Krisztian

cbeswick Thu, 06/28/2007 - 01:04

Hi,

It would appear then that A-UDLD was doing as expected, and actually error disabled the link on the access switch. The only problem is that the access switch interpreted the error disable as a loopback problem, maybe because A-UDLD wasn't configured and instead trigered a different protocol.

This would also explain why our access switches didn't recover from the err-disable. We only had recovery configured for UDLD err-disables, not loopback, doh!

As such we have to visit all 45 switches and reset them.

bbaillie Thu, 06/28/2007 - 03:27

If you get a unidiectional link and there exists an alternate bridging path the blocking port will start forwarding thus creating a loop. Yes UDLD acted per design but for a brief time there was indeed a loop in the network and the switches detected the loop and disabled their ports. The err-disable recovery was obviously not configured on the switches, only the default errdisable detect was on. For more info here is a link that has more detail on UDLD and loop guard.

http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a0080094640.shtml

Errdisable is here.

http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00806cd87b.shtml

Cheers,

Brian

cbeswick Thu, 06/28/2007 - 03:55

Hi Brian,

Yes - I would agree, if an alternate root existing. The access switches are not currently dual homed to a resilient distribution layer, only a single fibre connecion exists.

So you understand my situation. Something doesn't add up.

bbaillie Thu, 06/28/2007 - 04:25

Hmmmm, more than one event happened here, the switch determines if a loop is present by watching for a keepalive that was sent by a port and being received on the same port. This means there was indeed a loop in the network, the UDLD error may be a secondary event and not the one causing your outage due to loopback detect. Without any redundant links in your network my suspicion would be a human patching in a cable where they should not have. I and others have seen switches after a brief power outage or brownout simply forwarding packets blindly regardless of which port recieved it ( hublike activity and not populating the CAM table, switch management also goes down) and causing loopback errors on other switches, check for the possibility of a brief power problem causing a switch to go zombie. Check the event logs on UPSes, if they are available, it may yield some answers.

Cheers,

Brian

abdel_n Thu, 06/28/2007 - 08:01

Hi,

UDLD in aggressive mode can detect unidirectional problems and misconnected ports on a point-to-point links (with no redundant connections) when one of these problems exists:

- One of the ports on the fiber-optic cannot send or receive traffic.

- One of the ports on the fiber-optic is down when the other is up. (check if ports are up in the distribution side)

- One of the fiber stands in the cable is disconnected.

As the common recent event to all access switches is ?the aggressive UDLD configuration on distribution switches?, try to configure UDLD back to normal mode and see what will happen, at least you can exclude the conflict between UDLD modes.

Please let?s us know if you have resolved the problem.

Abdel

cbeswick Thu, 06/28/2007 - 22:49

Hi,

Changing the distribution switches back to standard UDLD was the first thing I did before getting my team to visit all the comms rooms to reset the access switches.

Thanks for all your responses. But I do not think this is something I am going to get to the bottom of.

I know for a fact that no loops were created on the network via incorrect patching. There were no power outages, there was nothing whatsoever that can explain what happened. I was rather hoping that a Cisco engineer might be able to bring something new to the table.

The only other explanation is some kind of a software bug.

Actions

This Discussion