Rapid Spanning Tree in Access Layer?

mfarrenkopf · ‎04-10-2012

My colleague and I have been having a discussion about using rapid spanning tree in the access layer. Most of our infrastructure has been migrated to a routed access layer with 3750s.

The idea was brought up to configure the switches with rapid PVST. On the surface, it seems like a better idea, faster convergence, in the event that spanning tree ends up being used for some reason. My colleague prefers sticking with standard PVST. His argument is that, in the event of a layer 2 loop, some consumer-level switches filter out BPDUs and if the control plane is overwhelmed, the shorter timers of rapid PVST just puts that much more of a burden on the CPU trying to regain control, whereas with standard PVST it will have around 20 seconds before it starts to engage. (It may still be overwhelmed, but the longer timer delays the additional burden.) He says he's seen this problem with rapid PVST and that his opinion is backed up by our Cisco rep. (I haven't spoken to him yet.)

In our model, it should be very rare -- pretty much never -- that we would layer 2 span another switch off of our access stack.

One suggestion I saw is to use BPDU Guard, which is a good suggestion as well.

But we have had experiences with overloading the control plane on a 3750. I believe that concern is valid. If the CPU can't service spanning tree, then all hell will break loose anyway. But I'm interested in hearing about other experiences people have had in terms of rapid spanning tree in the access layer, end users plugging in unauthorized devices and creating loops, and the effects when using rapid spanning tree vs standard spanning tree.

Peter Paluch · ‎04-10-2012

Hello,

My colleague prefers sticking with standard PVST.  His argument is that,  in the event of a layer 2 loop, some consumer-level switches filter out  BPDUs and if the control plane is overwhelmed, the shorter timers of  rapid PVST just puts that much more of a burden on the CPU trying to  regain control, whereas with standard PVST it will have around 20  seconds before it starts to engage.

To be honest, I find this statement highly questionable and I do not see the logic in it.

First of all, STP and RSTP are to prevent switching loops, not to create or aggravate them. Both STP and RSTP are proven to prevent switching loops if properly deployed. If some switches filter out BPDUs then nothing is going to prevent the switching loop from happening, and the mere fact that it appears sooner or later is not going to alleviate it in any way. After the switching loop is created, it does not make sese to talk about "CPU trying to regain control" as there is no such thing. If the STP/RSTP works correctly, the switching loop will not ensue at all. And if STP/RSTP is not working properly, then after the switching loop is created (sooner or later), there is nothing the CPU is going to do in order to stop it - because the only way to stop it is to run STP/RSTP correctly, and according to the premise, it is not working

RSTP convergence from Discarding to Forwarding state, if driven by timers, takes 30 seconds which is the same time as in legacy STP (no need to talk about per-VLAN semantics here, they do not make a difference to the point of this discussion). Running STP or RSTP in this aspect makes no difference whatsoever.

The 20 seconds you are mentioning probably refer to the STP max_age timer which is not used in RSTP, rather, there is a timeout of 3x Hello interval to assume that the neighboring switch has failed. Note, however, that the max_age in STP is concerned with the following state transitions only:

from Blocking to Listening - this transition has no realistic impact on CPU because neither of these states allows for creation of a switching loop, as data frames are not forwarded in any of these states
from Root Forwarding to Designated Forwarding - note that in this case, the port is already Forwarding so the transition is not going to change anything to the existing switching loop

If I am missing any point please correct me. But right now, I simply do not see anything substantial to corroborate the original statement.

If the CPU can't service spanning tree, then all **** will break loose anyway.

Yes, that is true, but note that this means that the STP/RSTP did not work correctly before the CPU became unable to service the STP/RSTP. You can't blame a dysfunctional protocol for not doing its work correctly. Once again, it is my firm belief that if STP/RSTP are deployed appropriately, a CPU overload resulting from a switching loop simply cannot occur, because the switching loop will not be created in the first place. Neither STP nor RSTP is prone to creating switching loops unless misconfigured or prevented from working properly by some unforeseen external means (such as other switching dropping BPDUs).

Best regards,

Peter

bbaillie · ‎04-10-2012

Hi,

Peter makes some valid points.

I would also like to point out that if you are relying on BPDUGUARD to save the day, the logic does not fit. Here is why, if some devices filter BPDUs the BPDUS will not arrive at the port, for BPDUGUARD to engage it does so upon reciept of a BPDU. Therefore if BPDUs are filtered then BPDUGUARD does nothing so you still have a loop. The correct feature is LOOPGUARD. You need to enable SPANNINGTREE LOOPGUARD, then enable ERRDISABLE DETECT for this item, enable ERRDISABLE RECOVERY for this item. Now you will detect the loop, disable the port, five minutes later the port will attempt to recover and get shut down again, a canary in the coal mine if you follow. Or don't recover your choice, but not a good choice..

The difference between PVST and RAPID PVST is only the beginning state when a port is first brought up. PVST will immediately assume the posture of negotiating spanningtree and if all looks good begin forwarding, of course a topology change will be generated by this port and disrupt you network for a short while. RAPID PVST begins by going to the forwarding state and if it recieves a ne BPDU then it negotiates spanningtree by blocking listen and learning. Of course after the succesful negotiation a topology change, not not as severe as regular PVST.

The CPU issues are only a big issue if you have a very large spanning tree domain and this particular switch has some ports it is currently blocking.

Rapid spanning tree has had some problems on specific IOS versions due to dropping BPDUs and causing sporadic topology changes. But overall use RAPID PVST it has more features than regular spanning tree which by the way was created over a weekend by a very smart person a very long time ago. RAPID PVST has far more features and protections against bridging loops and the spanningtree disruptions of regular spanning tree.

The extra features are on by default for RAPID PVST. LOOPGUARD and ERRDISABLE RECOVERY need to be turned on.

Cheers,

Brian

Peter Paluch · ‎04-10-2012

Hello Brian,

Thank you for your kind words!

I am afraid, though, that there are some statements in your response I can not agree with.

The correct feature is LOOPGUARD.

I do not believe this feature would help, either. The LoopGuard reacts to the event that BPDUs have been arriving on a port and suddenly stop arriving, without any further topological change. Note that if the BPDUs never arrived, one of the requirements is not met, and the LoopGuard will never jump into action. LoopGuard is a protection mechanism against unidirectional loops or STP implementation bugs that cause BPDUs to be lost unidirectionally, but it is not going to prevent a loop if the BPDUs have not arrived from the very beginning.

There is no sound protection against a neighboring switch that drops BPDUs. With STP/RSTP supported only partially in the network, also the protection is only partial.

RAPID PVST begins by going to the forwarding state and if it recieves a  ne BPDU then it negotiates spanningtree by blocking listen and learning.

This would be true only for PortFast (i.e. edge) ports. However, by default, all ports on Cisco Catalyst switches are considered non-edge, and even with RSTP, they have to go through Discarding - Learning - Forwarding sequence of states (or make a rapid Discarding -> Forwarding transition if both switches negotiate that using the Proposal/Agreement mechanism). No STP version, be it legacy STP, RSTP or MSTP, allows itself the luxury (and naiveté) of putting a regular port into Forwarding state immediately without first making sure that there is no loop.

But overall use RAPID PVST it has more features than regular spanning  tree which by the way was created over a weekend by a very smart person a  very long time ago.

Oh, I believe Dr. Perlman invested more time into fleshing out the first STP than just a weekend, and it certainly took quite a time and lot of thinking at IEEE until the 802.1D STP became standardized at all

Best regards,

Peter

bbaillie · ‎04-11-2012

Hi Peter,

Its always good to aknowledge someone who give good advice, then of course steal that advice and use it everywhere you can .

LOOPGUARD is their best hope if they choose to install or leave switches connected that behave so badly as to filter out BPDUs. Then the faint hope is the switch receives one of its own BPDUs back, triggers LOOPGUARD - ERRDISABLE and at lease they have a breadcrumb to follow as to where and when the problem started.

The presumtion that they have PORTFAST enabled because it is and access layer port is one I made, just to point out that RAPID PVST will at least react to new BPDUs received, regular PVST does not react in any way.

Actually Radia Perlman states that spanningtree came as to her just before going to bed on Friday. She immediately wrote it down. Then spent Monday and Tuesday writing it out in detail, and spent the rest of the week writing a poem to go with the idea. The IEEE actually did add more stuff to spanning tree that Radia did not like or agree with.

So in short It took a weekend for the invention of it by Radia and months of work by the IEEE with a team of engineers to mess it up .

Cheers,

Brian

Peter Paluch · ‎04-11-2012

Hi Brian,

LOOPGUARD is their best hope if they choose to install or leave switches connected that behave so badly as to filter out BPDUs. Then the faint hope is the switch receives one of its own BPDUs back, triggers LOOPGUARD - ERRDISABLE and at lease they have a breadcrumb to follow as to where and when the problem started.

But this is not the point of the LoopGuard at all. In Cisco's implementation, if a port receives back its own BPDU, it will be declared as Broken or Self-looped and put into Blocking (or Discarding) state. You do not need to activate the LoopGuard at all for this to happen. Please see this thread:

https://supportforums.cisco.com/message/3005900#3005900

I am showing the results of a port receiving back its own LOOP frame (note that this is not a LoopGuard protection!) and its own BPDU. Receiving our own LOOP frame back will result in err-disabling the port immediately, and this has nothing to do with STP at all. Receiving our own BPDU back will result in putting the port into Blocking (STP) or Discarding (RSTP) state. However, neither of these two mechanisms involves the LoopGuard.

Once again, the LoopGuard is a mechanism somewhat similar to UDLD that tries to prevent loops caused by unidirectional links when Root or Alternate ports stop receiving BPDUs and erroneously become Designated. It has nothing to do with a port receiving back its own LOOP frame or BPDU. If a port never received BPDUs, LoopGuard cannot act on it because the triggering condition - designated bridge's BPDUs ceasing to arrive without the port going down - will never happen.