Loop braught down whole network

rmujeeb81 · ‎09-05-2009

Hi,

We had an outage for one of our enterprise customer few days back. Upon recieving complaint we checked the network by capturing the traffic and found that there was ARP broadcast upto 40% of the total traffic captured and whole network was almost down that time. Finally we found a D-Link hub where an end connected back ethernet cable on the same hub caused loop in the network. We removed the cable and network came back to normal condition. My question is that how can we prevent such loop even if STP is running on the access layer switches where these type of hubs are connected becasue these hubs are not manageable so they don't send BPDUs. What could be the possible solution here to prevent such loops cause by end user ?

Another question is , if the campus network is flat means VLAN 1 is being used all over the campus than by creating multiple vlans can limit the scope of such loops to that VLAN only ?

Thanks

mahmoodmkl · ‎09-05-2009

Hi,

Not sure of the 1 question but yeh if u used vlans this could have prevented the loop upto the vlan.

Thanks

Mahmood

Peter Paluch · ‎09-05-2009

Hello Mujeeb,

Was that a hub or a switch? Nevertheless, the best solution would be probably to replace all hubs and unmanageable switches in your customer's network with manageable switches. However, I am not sure if this can be arranged.

If the loop cannot be prevented, at least its effects should be minimalized. I suggest reading this URL - it is a Catalyst 2960 guide about traffic storm control. In short, it allows you to define per-port threshold for normal and abnormal amounts of unicast, multicast and broadcast traffic. If the threshold is exceeded, the port can drop the exceeding traffic or it can be automatically err-disabled (an automatic recovery can be also configured):

http://cisco.com/en/US/docs/switches/lan/catalyst2960/software/release/12.2_50_se/configuration/guide/swtrafc.html#wp1063295

I am somewhat surprised that the Catalyst switch that was connected to this looped part of network did not err-disable its port due to looped LOOP frames. However, it is possible that they were lost in the storm.

Creating multiple VLANs would allow a containment of a broadcast storm only to a VLAN of its origin so it would indeed not spread to stations in other VLANs. However, a broadcast storm, even in a single VLAN between two switches, can be so bandwidth-intensive that it effectively knocks out a switch regardlessly of the size of the VLAN. The switch will be affected as a whole, not just a single VLAN.

Best regards,

Peter

rmujeeb81 · ‎09-05-2009

Hi,

That was a D-Link hub which which is infact connected with a HP 2524 switch and STP is running on HP 2524 switch. There is a feature 'loop-detect' available in HP switches but this feature is not available in curretnly installed switches. I have configured broadcast-limit feature which is similar to storm-control feature in cisco but I read on different tech forum that 'broadcast-limit' feature is not that much efficient in case of such loop. Another query regarding dividing campus network in different VLANS which is offcourse a recommended design, let say a end user creates loop and that user is in a particular VLAN and this VLAN spans till distribution switch or collapsed core. Now if loop ocuurs and excesive broadcast generates within this VLAN which will ultimatly reach to core/distribution so core/distribution could be non responsive to legitimate traffic due to high cpu utilization casued by broadcast within that VLAN. What should be the best practice to prevent such situation in general ?

Regards,

Mujeeb

Peter Paluch · ‎09-05-2009

Hello Mujeeb,

Yes, I admit - the traffic storm control does not prevent the storm from occuring, it just provides a method of dampening but it cannot completely eliminate it. If you have a switched network then the only way to prevent switching loops is to run the STP on all switches and watch very closely for unauthorised attaching of dumb hubs or switches by users.

Is there anything similar to BPDU Guard on the HP switch? I assume that the loop was created on a D-Link hub somewhere in user's office which means that the hub was connected to an access port on the HP. The access ports should not receive BPDUs because only PCs should be connected there, and if a BPDU is received on such port, it should be disabled because receiving a BPDU on an access port either indicated an unauthorized expanding of a network, or a loop (or both). The BPDU Guard on Catalyst switches err-disables a port if a BPDU is received on it. Perhaps, if the HP supports it, it would be helpful also here.

Regarding your question about the best practice: I believe you know about this but let me write it just to be sure: For a couple of years, Cisco proposes the idea of so-called local VLANs, as opposed to end-to-end VLANs. There is no technical difference between the two, only the coverage differs. The end-to-end VLAN spans throughout a campus or enterprise so that all stations that belong together, regardless of their location, can be in single VLAN. The local VLAN is usually created for a floor in a building, or for a workgroup on a floor, and what is very important - it is bounded by the distribution switch and it never continues past it towards core.

Having the network segmented in this manner perhaps requires more VLANs than the traditional design but at the same time, the entire network becomes much more manageable, traffic patterns are more predictable and also the failure domains are smaller. As a consequence, if a storm is created in a VLAN, it may propagate up to distribution switches but as the VLAN does span past the distribution layer, the remaining part of the network is not affected.

Ultimately, however, you cannot have a redundant switched topology without fully running STP. Any other approach just tries to minimalize the negative consequences of a storm induced by Layer2 loops.

Best regards,

Peter

Leo Laohoo · ‎09-06-2009

Been there, done that. Early last year, one of my client's site went down for 4-hours because the building's network went into a storm. We found an un-managed switch plugged into two different VLANs as the culprit. The business unit was too cheap to pay for additional cable done. Nevertheless, the same thing happened a few months later at the same building from a different business unit (different floor). From that time on, because the building were populated by developers, we told the client that all-bets-are-off in SLA for that particular building and restoring any future network outages for the site will be "best effort". Our client had no choice but to accept our terms.

eyerOck2007 · ‎09-06-2009

Where STP or especially RSTP is involved, as it should be in a redundant layered network structure AND constant loop problems are disabling the network with broadcast storms; A laptop running a terminal program with the switch console up in a locked room (and a large scroll back buffer) would be useful. During an event you would see something like this:

00 :22 :21 : %LINEPROTO-5-UPDOWN: Line protocol on Interface Fast Ethernet O/1 0,changed state to down

00 :22 :26 : %RTD-1-ADDR FLAP: FastEthernet O/4 relearning 46 addrs per min

00 :23 :27 : %RTD-1-ADDR FLAP :FastEthernet O/4 relearning 66 addrs per min

00 :24 :27 : %RTD-1-ADDR FLAP: FastEthernet O/4 relearning 93 addrs per min

00 :25 :26 : %RTD-1-ADDR FLAP :FastEthernet O/4 relearning 93 addrs per min

00 :26 :25 : %RTD-1-ADDR FLAP: FastEthernet O/4 relearning 96 addrs per min

00 :27 :26 : %RTD-1-ADDR FLAP :FastEthernet O/4 relearning 91 addrs per min

00 :28 :25 : %RTD-1-ADDR FLAP: FastEthernet O/4 relearning 52 addrs per min

Notice from these actual logs of an upset that the messages are coming in quite fast with a bridge outside on port 4 relearning quite quickly.

It narrows down the offending Port and as far as I know only shows up in the console. I found this out by accident while hooking up a Cisco switch to a mish mash of strange networked industrial nodes (equipment) with a poor layer 1 infrastructure maintained by others. It resulted in a finger pointing war that I initially lost until I proved to them without a doubt a bios update was needed in some equipment hanging on the network.

Theres an article on Cisco's site explaining the ADDR_FLAP error which points quite clearly to a poorly connected or defective STP enabled bridge or switch.

When the Bios's were updated in the equipment afterwards it negated severe repetitive consequences. One of the consequences of taking another departments word for a job completed.

rmujeeb81 · ‎09-09-2009

Kindly clarify one more thing. Let say there are multiple vlans in the network and one of them is vlan 10 for example. SVI for the vlan 10 is on distribution switch means trunk between ditribution switch and access layer switches should allow VLAN 10. What would happen if broadcast storm generated within VLAN 10, would it reach distribution switch as well ?

If yes than it could bring down the distribution switch due to high CPU utilization caused by excessive broadcast ??

Regards,

Jon Marshall · ‎09-09-2009

Mujeeb

Yes, a broadcast storm within vlan 10 would indeed affect the distribution switch if the connection was a L2 trunk between the access-layer and the distribution switch.

And yes it could well bring down the distribution switch.

Jon

rmujeeb81 · ‎09-09-2009

Hi John,

Can you recommend some proactive configuration guid lines which one should implement within campus network to prevent such situation.

Thanks.

Leo Laohoo · ‎09-09-2009

spanning-tree bpduguard enable on each access port and udld port aggressive on each fibre optic uplink.

We also have the following:

switchport port-security maximum 4

switchport port-security

switchport port-security aging time 2

switchport port-security violation restrict

switchport port-security aging type inactivity

Hope this helps.