Intermittent ARP/STP related problem

We have two Catalyst 3550 switches on the core level, and several Catalyst 2924XL switches on the distribution level, with several VLANs trunked using 802.1q encapsulation and HSRP for gateway redundancy.

Recently we have encountered intermittent issue, affecting certain servers on one of the VLANs. The symptoms:

- Intermittently, some of the servers on one of the VLANs will suddenly lose connectivity to the gateway.

- From the gateway (Catalyst 3550 layer-3 switch), I can see the IP addresses on the ARP cache table, but I can't ping to the servers.

- After I cleared the arp-cache on the gateway, then I was able to ping to the servers again. But after some time (around 10-15 minutes), the problem will happen again.

- The strange is, I don't see any difference on the MAC address learnt from this IP address on the ARP cache, before and after I cleared the ARP cache. So looks like it's not due to ARP poisoning issue.

- The problem is affecting some servers, not all servers, on the specific VLAN, and it affect one server after another at different times. Meaning, if now it's affecting one server, another time it might affect another server.

- The problem does not happen to all affected servers at the same time.

- The problem does not affect servers on other VLANs (other than this specific affected VLAN).

- We have tried to migrate the subnets (there are several subnets on the VLAN) to another newly created VLAN, but the problem seems to follow to the new VLAN. So looks like it's not VLAN-related problem.

- We noticed that there's some topology change on the STP during the time when the problem start to happen to one of the servers. This is the message:

STP: VLAN0195 Topology Change rcvd on Fa0/5

Not too sure whether it's related?

The core is running Cisco Catalyst 3550 WS-C3550-24-SMI with c3550-i5q3l2-mz.121-11.EA1.bin image. Do you think upgrading the IOS to the latest version (c3550-i5q3l2-mz.121-22.EA5.bin) will help?

Anyone faced this strange problem before and able to give me some advice? So far, as a temporary workaround, I have created a script which will run the "clear arp-cache" command on the gateway every 5 minutes, which is definitely not the permanent solution.

Any help will be greatly appreciated. Thanks in advance.

Re: Intermittent ARP/STP related problem


I'd investigate the topology changes.

The result could be a recalculating of the spanning tree and uplinks might be in blocking->listening->learning state before forwarding.

Another issue is that the mac-address table has to be updated if an uplink changes.

Depending on your topology you could enable uplinkfast on your access-switches. Uplinks change immediately into forwarding state and the network will be flooded (in order to relearn within 15 seconds) with the content of the CAM table.

Re: Intermittent ARP/STP related problem


Thanks for your advice. I've investigated the STP topology change, but I couldn't find the relation to the problem. When the problem happens, it only affect some, and not all the servers on the same VLAN and connected to the same access-switch.

Meaning if we have server A and B on the same VLAN and connected to the same access switch, when the problem happened on server A, server B was not affected at all. In fact, there are some servers on the same VLAN and connected to the same access switch which were never affected at all.

So I'm not too sure whether the problem is related to the topology change. I'm also not too sure what causes the topology change since during that time, there's none of the uplinks having problem. Furthermore, the topology change error didn't occur every time we spot the problem.

Spanning tree convergence should be fast in case if there's a topology change, but in this case, once a server is having problem, the server will be unpingable for at least 15-20 minutes, after which it will goes up by itself, or if I clear the arp-cache manually on the gateway.

Could there be any possible cause of the problem? ARP-related attack from one of the servers?