Spanning-tree and network outages

Answered Question
Oct 20th, 2008
User Badges:

Looking for some troubleshooting feedback on this one.


In a single vtp domain environment containing about 120 switches (mostly Cat 3560s and very few DLink Des3550s) I've recently started to see a few network wide connectivity drops, very short in nature but totally unacceptable either way. The fact that everything is configured default pvst made me wonder if the short downtime was a STP recalculation + the network converging again. Very soon we'll be looking at moving layer 3 out to each network closet but in the mean time I want to find the culprit with the current setup. Syslogs aren't showing anything concrete and I don't see any ST inconsistencies from my root bridge. Anyone have a few tricks to track down these issues?


Thanks in advance,


Jim

Correct Answer by paul.matthews about 8 years 7 months ago

I agree you need to track where the address *should* be.


I would also sugest that if you don't already use them, use BPDU-Guard and port security. Port security can be used to effectively restrict a user port to a single mac address - ie a user puts a hub there to connect a second PC and only one will work, BPDU Guard should be used on all edge ports of your network. What that does is protect your network against someone plugging a real switch in that will send BPDUs. The effect would normally be that of someone plugs a switch in, the port shuts down and the user then has to ask for the port to be re enabled, giving you the opportunity to educate them about the issues of connecting unauthorised network devices to the network!


Paul.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (1 ratings)
Loading.
Istvan_Rabai Mon, 10/20/2008 - 08:55
User Badges:
  • Gold, 750 points or more

Hi Jim,


On Cisco switches you can use the "debug spanning-tree" and/or "debug spanning-tree events" and similar debug commands to get more information on what is happening exactly.


"debug spanning-tree " has many options. Use the "?" after this command and you will see.


Cheers:

Istvan

jim-phillips Mon, 10/20/2008 - 11:19
User Badges:

I've used the "events" and "all" variations of the debug spanning-tree command before but in a much smaller environment. I've also seen large scale debugging cripple devices so I am somewhat reluctant to do so during production hours.

Istvan_Rabai Mon, 10/20/2008 - 11:32
User Badges:
  • Gold, 750 points or more

Hi Jim,


Your point is right.


In your place I would first try this debugging out of working hours while logging the debug outputs onto a syslog server so you can examine them later.


If this phenomenon happens only during working hours, then you can do debugging on a switch that is less important or has less users (not on the root switch for example).


But my supposition is that a debug spanning-tree events command alone should not crash a switch, unless its processor is already overwhelmed with traffic/processes that you don't know about.


Cheers:

Istvan


jim-phillips Mon, 10/20/2008 - 11:44
User Badges:

Thanks,


I'll go ahead as planned after hours and post results.

paul.matthews Tue, 10/21/2008 - 02:11
User Badges:
  • Silver, 250 points or more

Pick a couple of VLANs that are affected and do sh spann vlan det. That will say something like:

Number of topology changes 152 last change occurred 18:18:00 ago

from GigabitEthernet1/1


If the time is similar to the time since the last incident, spanning tree was affected - spanning tree may or may not be the culprit though! If you are looking at it 5 mins after an incident,and the last topology change was two months ago then spanning tree i just fine.


This does depend a little onyour network being correctly configured -if you haveuser ports not set as portfast, every tine the port comes up all switches between the port and the root will see a topology change.

jim-phillips Wed, 10/22/2008 - 13:54
User Badges:

So another outage...

I started checking stp on individual vlans


I got similar results on numerous switches for a specific vlan:


Example of "show spanning-tree vlan 201 detail" three hours after the outage.

#################################

VLAN0201 is executing the ieee compatible Spanning Tree protocol

Bridge Identifier has priority 32768, sysid 201, address 001f.260c.6480

Configured hello time 2, max age 20, forward delay 15

Current root has priority 32969, address 000a.b89b.3100

Root port is 1 (GigabitEthernet0/1), cost of root path is 8

Topology change flag not set, detected flag not set

Number of topology changes 122 last change occurred 03:04:52 ago

from GigabitEthernet0/2

Times: hold 1, topology change 35, notification 2

hello 2, max age 20, forward delay 15

Timers: hello 0, topology change 0, notification 0, aging 300

#########################################


The topology change lines up very closely with the outage for the majority of our switches. Not all were hit.


We also run ZenOss as a monitoring software and it reported about 400 instances of this shortly after the outage.


##########################

Host 001a.a0bd.23c6 in vlan 201 is flapping between port Gi0/2 and port Gi0/1


Host 00e0.18ba.bc5d in vlan 201 is flapping between port Gi0/1 and port Fa0/4

##########################



Flaps happened between trunk links multiple times with the same five minute span, all tied to a single vlan.



Still at a loss as to what exactly is causing this. I'll be setting up STP event debugging tomorrow.



glen.grant Wed, 10/22/2008 - 14:51
User Badges:
  • Purple, 4500 points or more

This can be caused by someone looping the data cable between 2 different ports on the same switch or someone that has one of those nice hidden home routers under their desk and then they loop between the ports on that . These are always the hardest to find . Are those ports user ports or uplinks ? If user I would take a very close look at the whats on those ports and not just in the closet also out on the floor . track those mac addresses and see what port they are actually on when the network is quiet.

Correct Answer
paul.matthews Thu, 10/23/2008 - 00:28
User Badges:
  • Silver, 250 points or more

I agree you need to track where the address *should* be.


I would also sugest that if you don't already use them, use BPDU-Guard and port security. Port security can be used to effectively restrict a user port to a single mac address - ie a user puts a hub there to connect a second PC and only one will work, BPDU Guard should be used on all edge ports of your network. What that does is protect your network against someone plugging a real switch in that will send BPDUs. The effect would normally be that of someone plugs a switch in, the port shuts down and the user then has to ask for the port to be re enabled, giving you the opportunity to educate them about the issues of connecting unauthorised network devices to the network!


Paul.

Actions

This Discussion