Today a college hooked up a new switch into our network. It was a 3560 being connected to a pair of 3750s in a stack, so one uplink went to Gi1/0/18 and the other went to Gi2/0/18.
When the second cable (Cat6) was connected a bunch of our switches went down. The uplink ports went into err-disabled and the error in the log was:
%ETHCNTR-3-LOOP_BACK_DETECTED: Loop-back detected on GigabitEthernet0/48.
%PM-4-ERR_DISABLE: loopback error detected on Gi0/48, putting Gi0/48 in err-disable state
Now the expected behaviour was for spanning tree to just block the second connection to Gi2/0/18 and for no outage at all.
In the end about 10 switches went down with the loopback error.
So I have been researching the loopback error and found this link: http://www.cisco.com/application/pdf/paws/69980/errdisable_recovery.pdf
"A loopback error occurs when the keepalive packet is looped back to the port that sent the keepalive. The switch sends keepalives out all the interfaces by default. A device can loop the packets back to the source interface, which usually occurs because there is a logical loop in the network that the spanning tree has not blocked. The source interface receives the keepalive packet that it sent out, and
the switch disables the interface (errdisable). This message occurs because the keepalive packet is looped back to the port that sent the keepalive. Keepalives are sent on all interfaces by default in Cisco IOS Software Release 12.1EA-based
software. In Cisco IOS Software Release 12.2SE-based software and later, keepalives are not sent by default on fibre and uplink interfaces. For more information, refer to Cisco bug ID CSCea46385"
So i have a few questions.
1) What are these keepalives for?
2) Why did the same keepalive get received and then multiple other switches get put in err-disabled? Spanning tree should have blocked the second port instantly (RSTP)
3) The link mentions that "In 12.2SE-based software and later, keepalives are not sent by default on fibre and uplink interfaces" By uplink I assume it means a trunk port?
Our switches are running 12.2(25)SED1, so not sure if that version has the 'bug' mentioned in the article.
So can anyone shed some light on why this loopback error occurred? And is there any adverse side affected to applying the 'no errdisable detect cause loopback' command.
I short I have seen this issue twice before,
1) A "hub" was connected to a switch via 2 cables, when portfast was configured.
2) The same as the above, but the switch connected was a vanilla confiured switch, no spanning-tree.
Not sure if you are really affected by the bug - I think you just need to review your config/topology.
In a situation like this, there is some luck involved as to whether you see your loopback or a BPDU first - and it depends on timing and the configuration of the devices being connected. Generally, a loopback is detected before STP has a chance to block the interface. If an interface is err-disabled, STP is no longer concerned with it.
This is what my theory was as well.
What do you guys think about disabling keepalives on trunk ports? Any disadvantages?
Absolutely a bad idea! This will fake the interface into being "UP" always, thus causing a black-hole for your traffic.
The keepalives are useful and protect the network from unforeseen topologies. Just as spanning-tree protects you from loops - don't disable that either (in a general sense, there are ethernet WAN situations where it can be done).
So why then are keepalives disabled by default on uplinks and fibre ports now?
And why does cisco reccomend to disable them in the links I have read. I have done some testing and and when the keepalives are disabled the interface does not stay 'up' always.
Good question, this is a response to a lot of cases being generated and confusion around this issue. I strongly discourage disabling keepalives unless there is a good reason (like having type 2 cabling). In desktop switches (2960, 3560, 3750, etc) the enhancement does disable the keepalives by default on uplinks and fiber interfaces - and at the same time doesn't fake link up. The faking of link up applies to the 6500 series - sorry for the confusion.
I will reiterate though: you need to find the problem with the topology, don't disable the protocols that stop the network from melting down.
Here is an excerpt from the original reasoning behind the default keepalive setting on desktop switches uplink and fiber ports:
Cisco supports Type 2 cabling, which does cause a loop in some situations. Switches are still used in environments where Type 2 cabling is used, so the keepalive / loopback detection mechanism cannot and will not be removed. If the keepalive / loopback detection mechanism is removed, a spanning-tree loop could occur, which is much worse then a port being errdisabled. (entire network going down vs. a single interface being errdisabled)
In addition, there have been many instances where the keepalive / loopback mechanism has prevented a bridging loop before spanning-tree could block the loop. Again, there is an advantage to errdisabling a single port instead of causing a network meltdown. (If you don't think this is useful, you probably don't see the benefit of UDLD Aggressive either, and you don't want me to start talking about how great a feature that is.)
Cisco has been convinced that there is less of a need to enable the keepalive / loopback mechanism on fiber and uplink ports because it is extremely unlikely that customers will use Type 2 cabling on network infrastructure ports. As a result, starting in 12.2SE, the fiber and uplink interfaces will no longer enable the keepalive / loopback mechanism by default.
The problem with the topology was introuduced when a new switch with redundant links was connected. This caused 9 out of 50 switches to go into err-disabled as a result of the loopback error.
From what I have seen disabling keepalives on ethernet trunk ports does not affect the way STP works. Ports are still put into the blocking state when necessary. Disabling the keepalives does not affect BPDUs which is what STP needs to detemine the physical topology and make decisions.
What i'm trying to understand is why this actually happened. Why did some ports go into err-disabled on not everything? How did these keepalives get out before STP blocked the redunadnt port?
We are using Cat6 on all our trunk ports. And our standard is to use the last two ports (Gi0/47 and Gi0/48) for the uplink to another switch (when fibre isn't used) I only want to disable keepalives on those trunk ports, since these are our uplinks. And if cisco is happy to disable keepalives by default on uplinks then i really cant see any issue with manually disabling them on these ports only.
My understanding of keepalives is that the are sent from and to the source interafces MAC address just as a test of the line. So they are supposed to be received by the interface that sent them. I believe that the error occurs when a keepalive from one interface is received on a different interface.
You are correct, the loopback detection will not affect STP. It can act as a second layer of protection, however, when STP is configured incorrectly. For instance, if you connect a switch with BPDU filter and you have portfast enabled without bpduguard.
I believe you see some ports go err-disable and not all of them *because* enough ports went err-disabled to stop the loop before it propagated entirely through the network.
Also, keepalives are at a lower level and cheaper for CPU to process than STP as it is a "dumb" protocol. If CPU is busy, you may have either STP or loopback detection kick in, whichever the CPU can service first. In a bridging loop scenario, there is a lot of chaos and it never happens the same way twice.
Feel free to disable keepalives, but I would not do it if I was not 100% sure what was going on in my network. As a troubleshooting step, it could teach us more about what might have gone wrong though. Are you going to be able to try to reproduce the problem for troubleshooting, or is there a network freeze now? Maybe we can go over the configs for the directly connected new device and existing 3750 stack.
Some cisco switches decided to disable the keepalives to reduce the number of cases it generated by customers that got STP and loopback messages and either one can do the same job in many scenarios. It's a compromise really. I will always say it's prudent to use them because a) they don't hurt you and b) they can alert you to a problem. The only time I would say you should go forward and intentionally disable them long term is if your network is known to loop back frames due to cable type or other L1 constraints.
As far as keepalives - they are sent to/from the same address and are supposed to be received at the time they are sent. If another copy is received, that means there is an L1 loop, or L2 loop. The frame shouldn't go through another switch and back out where it came from unless a loop is present. Keepalives can be seen on other interfaces in a hub network, and they should be ignored by other devices or interfaces as the mac address does not belong to the far end.
Ah ok, what you said about the keepalives makes sence... The error is triggered when more than one copy of the same keepalived is received.
I have tried to reproduce the issue in a test lab but have been unable to. We don't have a 3750 stack in the test lab, just a few 3560s. We are planning to hook up the offending switch again once the keepalives have been disabled.
Just some info - the 3750 is the root bridge for almost all our VLANs in the campus. It has about 16 other switches hanging off it, but none with redundant links. (soon to change) The whole campus has around 50 switches connected using a mix of Fibre and Cat6. There are loops in the network for redundany which STP is taking care of without any issues.
The offending 3560 was connected to the 3750 first on Gi1/0/18, then on Gi2/0/18.
I have attached the configs from the 3750 and the 3560 if you would like to have a look. I don't think it was a config issue however
Thanks for your help so far!
>> As far as keepalives - they are sent to/from the same address and are supposed to be received at the time they are sent. If another copy is received, that means there is an L1 loop, or L2 loop. The frame shouldn't go through another switch and back out where it came from unless a loop is present. Keepalives can be seen on other interfaces in a hub network,
This makes sense and it is what should happen but I'm afraid that some implementations react also to first copy of keepalive and so the bug.
in my humble opinion the explanation of this bug is totally wrong.
Ethernet keepalives are meant to be received back by the same device that has generated them.
They use a special encapsulation with SA = DA = interface MAC address.
What has changed is probably they don't react wrongly to a correct event.
Several collegues over time have reported to see these errors on switches.
In your case the effect has been so big that it might involve something else.
Hope to help
I also agree with you that the bug description from Cisco is confusing. Perhaps it means that the keepalives are looped back to a different port? i.e. A different port to the one that sent the keepalive?
My theory still is that a few of these keepalives got out before STP could block the second port. These keepalives when then received on a few ports that did not expect them, which created the err-disabled state.
This bug (CSCea46385) was actually a Junked bug because it was filed while the switch in question was going through a bridging loop. There is an enhancement we have been talking about in CSCdz72393 to disable the keepalives on uplinks and fiber interfaces to aid in avoiding confusion. It has nothing to do with networking, it's about documentation and hiding confusing output where it causes more trouble. I wouldn't worry about that bug too much, it's not very relevant. Also, I have not seen a bug where the switch erroneously interprets it's loop replies.
The theory of these keepalives is doing my head in.
Does the interface that sends the keepalive expect to see the keepalive back or not?
From what i have read keepalives are sent with the source and destination MAC address of the interface that sent it.
If this is the case then the interface should expect to receive the keepalive back.... right?
Now if a downstream switch does not have the mac-address of the source interface in its mac table it will broadcast the keepalive out all intefaces except the one it was receive on. So if there is a loop in the network it is this behaviour that could cause the source interface to recieve the packect twice.... right?
I will take a look at the configs when I get a chance by the way - but generally they look okay.
The interface that sends the keepalive will definitely see it via it's loopback once. However, it should not see it again.
If a downstream switch does not have the mac of the source and it receives this loop reply frame - it *now* does, since smac and dmac are the same.
If there is a loop in the network, mac addresses will flap very quickly as the packets storm around both directions.
Essentially the rules of sending a frame out all ports except the source interface cannot really be followed, since the mac address is in constant flux. What you get is frames being duplicated in the storm and sent out all directions. Eventually, you can see your own loop frame.