I really have a question regarding STP timers, but I need to give a little background on a particular issue we're experiencing right now. Please refer to the diagram I've attached.
We currently have clustered servers in two geographically distinct locations. We tie these together (L2-wise) via 15454s across our MAN. At our main location, the servers are attached to a SAN that is connected via a fiber channel connection to the other location. Backups to the remote SAN are made in real time. The servers are set up to "heartbeat" every 10-15 secs...basically a ping to the other server. If it misses a ping, it will initiate a script that makes the remote server and SAN become active.
That said, our issues are that STP takes around 35s to recalculate and begin to forward traffic...which means that the 15s pings that the servers send out are missed which causes it to failover. Problem with that is that the script being used is not very graceful, and does not deal with this very well, nor does it have built into it how to respond to the primary coming back online....but that's another story for the server guys to fight.
Having the server/application guys fix their servers aside(by upping the ping timer to a value consistent with STP), what kind of STP changes could bring our convergence times down without bringing the network to the point of instability? I believe all of our STP values are now at default.
I do not know what you are trying to do to make the STP reconverge all the time.
But, here is a suggestion that you might want to try out.
Configure spanning-tree portfast on the ports that are connected to the servers directly. By this way, the ports that are connected to servers would goto FWD immediately.
Let me know if this helps.
How are your forward delay timers configured for the VLANs to which the servers are connected? Did you try reducing the forward delay?
I'm new to the organization so I'm not sure what they were set up as. Can you direct me to how to find that out? ie...'show' command.
Think I may have found what you're talking about:
sw-xxx#sho span root forward
These haven't been touched. Can you speak to what this timer does, and what effect it would have by reducing it in a production environment?
pthiagas, thanks for the suggestion but portfast is already configured on the ports these servers are connected to.
The issue is not so much the servers coming online, it's the time it takes for STP to converge when a change, whether planned or unplanned, has occurred.
Like I alluded to in my original post, the issue mainly lies with the servers not conforming to how the network is built. I'm simply trying to learn of any STP tweaks that could safely be put into place to help, if any.
As Pari pointed out what is actually happening to make STP reconverge. In a generally stable network where you are connecting servers/clients etc. STP should not need to reconverge too often.
That said, although it is not clear from your visio what type of switches you are running you may well want to look into Rapid PVST+ which reduces the convergence time down from approx 50 seconds to less than 5 seconds which would then fit it in with your server guys script. Not all switches support Rapid PVST+ and it only takes one switch that doesn't to make all the switches fall back to normal 802.1d. Note that this is done a per-vlan basis.
The forward delay that Pari was referring to is one of the timers STP uses when it is calculating a loop free topology. You can change this but you really need to understand STP and your own network before you start making these sort of changes. Attached are a couple of links for Rapid PVST.
Understanding Rapid PVST
Migration from PVST+ to Rapid PVST
Jon, you're correct...and this network normally does not need to recalculate very often at all. We've just had two incidents lately(completely separate) that have caused this issue to even surface because it was never communicated to the network group how these pings were setup, or we would have set their expectations differently.
That said, I've heard of Rapid PVST+, but haven't looked much into it. I'm not sure, but I suspect that not all of our switches would support it. I'll check. But can it be turned on for only specific vlans and leave others on the existing STP version? I suspect not, but I'll read the links you provided.
Portfast is on the server ports already, and it seems that my issue would be with the recalculation on the blocked port to the root as opposed to the server port.
So, one more question. As I have it drawn, would I be correct in assuming that the switch that is active would only be one diameter away? Shouldn't I be able to just look at a network drawing and determine the diameter? Is there a command that will allow software to show you the diameter of the switch you're on?
When you say can it be turned on for specific vlans it really depends if the vlan you are turning it on for only exists on the switches capable of running Rapid-Pvst.
So you would need to clear the vlan off any trunk links that have non rapid pvst+ capable switches at the other end. Of course this may not be so simple if the no Rapid PVST switch has clients/servers attached into that vlan.
Configuring Rapid Pvst is really easy, you just need to check whether all your switches support it or not.
Will do Jon. As it turns out, the server group is working on tailoring their pings to fit the STP timers as they currently are, but this conversation has been very useful. Thanks for the input and for pointing me in the right direction.
As previous poster indicated I would think this problem might be eliminated if you have spanning tree portfast on the server switchports. If its not on then yes you will wait from 35-45 seconds for spanning tree to run , portfast eliminates this wait for attached access devices. This also elimnates topology changes for any port that uses portfast. That being said it is not a good idea to start messing with default spanning tree timers .