I have two 6500 switches that connect to two load balancers in active/standby mode. When the load balancers fail over the arp cache of one particular vlan on the switches does not update and will not update without manually clearing the arp. All other vlans update their mac. Any idea what could be causing this?
These are decent boxes, I use them all the time with no issue's at all.
Are you sure you have the HA setup correctly on the boxes as there are a few ways in which the F5's can failover, some of which can cause issues in some situations, which have you selected and how are you testing the failover.
While I was waiting for information on the HA. I was reading further about spanning-tree on the interfaces to the F5's causing this type of issue.
Have you ever failed over a BigIP or run configsync on your BigIP cluster and some of your VIPs are no longer reachable or even pingable once the failover operation has completed only to have it all suddenly working again after 4 hours? Well, it is not your fault (unless you also manage the switches on your network as well. In that case...*cough* *cough*)
I'm not usually one to go into detail on lower level protocol info because the slightest mistake would result in a litany of comments about how the entire article must be wrong because I used the term "sub net" instead of "subnet" or something similar. So I will try to keep this one short and sweet.
The symptoms of the problem have already been described: Failover or sync the configuration with the standby unit (which also has the interesting effect of causing a fast failover/failback) and when all is said and done, a minority of your VIPs are no longer reachable. They can not be pinged. Scratch your heads for a while and after 4 hours, all of a sudden those VIPs are back--working again as if nothing happened. If you were to view the arp cache on your primary switch (assuming all failover events are completed) you will ultimately see the mac address of the standby unit advertised for those IPs that are no longer pingable. Since the default for arp is to flush that cache every 4 hours, if you do nothing then those affected VIPs will be pingable again after 4 hours.
This is because you have spanning tree enabled on the switch port that your BigIPs connect to. Assuming you are running a cisco network, enabling portfast (or maybe even porthost) on the switch ports used by your VIPs should prevent the problem from ever happening again.
If you need to wait five years for someone in your network engineering groups to believe that it really is a switch problem (as I did) and you need an interim recovery solution, go into the "Virtual Address List" section under "Virtual Servers" in the BigIP Web Administration tool of the BigIP that is currently the primary unit. Then:
1. Select the failing IP Address from the list
2. Deselect the "Enabled" check box in the ARP section
3. Hit the "Update" button
4. Re-select the "Enabled" checkbox in the ARP section
5. Select "Update" again.
This will re-arp the mac address for your primary unit and overwrite what is in the arp cache on the switch. Never forget though that this is a bandage and that enabling portfast on those ports on the switch is the solution.
This article is primarily focused on how the BigIP is the victim but it could really be any load-balancer and the symptoms themselves don't really change. Some network event occurs and then some of your IPs are no longer reachable. 4 hours later, everything's great. Spanning Tree enabled on the pc/server port is the cause.
Any truth to this because I don't see spanning-tree enabled on the interfaces?
Highly possible, but that sounds like a bug with the F5 code than a cisco issue as it sounds like the standby F5 sends an ARP (possible gratuitous), for the IP of the VIP's, which is clearly wrong.
I have never felt that particular pain as I have always had the the interfaces facing the F5's set with Postfast and if they where trunked interfaces, I have used the portfast trunk command, but I have no idea how spanning-tree could cause this, I will investigate further :-s.
The F5's can do spanning-tree, but I always have it turned off.
We are pleased to announce availability of Beta software for 16.6.3. 16.6.3 will be the second rebuild on the 16.6 release train targeted towards Catalyst 9500/9400/9300/3850/3650 switching platforms. We are looking for early feedback from custome...