I've been troubleshooting a weird issue recently. I have 2 x 3750G-12S as a distribution layer providing routing functions for a layer 2 access layer. The distribution layer hold the SVI's and are running HSRP between them. They have a cross connect betwen them and each have 3 x physical links that form a L2 Port-Channel down to a stack of 6 x 3750G-24PS-E access switches. The 3 x physical links on distribution 1 land on Switch 1,3 & 5 on the Access layer stack and the 3 x physical links on distribution 2 land on switch 2,4 & 6 on the access layer stack. The spanning tree root and the HSRP active device is distribution node 1. The WAN links sit behind the distribution nodes.
I had a call from a client telling me that sometimes he can access his server that hangs off one of the access switches from a certain IP at another site, and not from a different IP (when he should be able to). Then, the next day he'll be able to access the server from the IP he couldn't the day before and won't be able to access it from the one that worked previously. Interestingly on the machines that it doesn't work with on that particular day, the traceroute only gets as far as the inbound distribution interface and then just displays * * * for a timeout.
After some investigation, it seems that distribution switch 1 is the problem. If I change the spanning-tree root to be distribution 2 then all is well. This is because the uplink from the access switches to distribution switch 1 is in the blocking state and passing no traffic. This led me to look in to the port-channel on distribution 1. No errors could be seen, so I tried removing the ports from the channel and bring each physical link up one by one to see if there were errors on just one of the ports. To do this I changed the root of the spanning tree back to distribution switch 1 but I still ran in to the same issues using each seperate physical link. Change the root back to distribution 2 and all is well again. After this, I rebooted distribution 1 and still had exactly the same issue. The only other thing I can now think of is prehaps a faulty ASIC on distribution 1, as the links to the distribution box all use the same chip. But would this cause random traffic to work on day and not the next?
Can anyone think of anything else that this might be?