We had a new network installed around march. To save money, we used the catalyst Express 500s. We setup Cisco phones as well. 7 buildings, all connected with fiber. The data was rock solid, no issues at all, but the phones were another matter, very unstable. So we upgraded to 3560s. Now we are having the opposite problem, the phones are rock solid, but the data is not very dependable. In particular, we have two servers that are connected to different swithces, and the users continue to get disconnected from the siwtches. I have tried hard coding the speed and duplex, and replaced NIC cards and cables. Things go great for a week, and then trouble starts building slowly, and then it gets to the point where people are dropped every 10 or 15 minutes.
Could really use some help!
If you have any dual NIC servers that are connected to 2 switches make sure the 2 NICs are set for fault tolerance rather than load balanced teaming. If you need the load balancing the 2 NICs should be connected to 2 switch ports configured for etherchannel.
When users are being dropped do pings to the servers fail? Are the users on the same VLAN as the servers? Is it possible that the servers are overloaded?
Please rate helpful posts.
No, no dual NIC servers.
Pings to the servers might fail, for just one or two passes (using ping -t in windows, most get replied, but one or two get dropped, that is usually when the server is in "lost land")
Most of my users PCs are hanging off of Cisco phones, so I would not know how to look at that for a setup.
Both servers have about 5 users each, and CPU utilization barely creeps past 5 percent.
This usually "builds". Things will be great for a week, then things start to slow down, then maybe 1 or 2 crashes, then just crashes all over the place. (Caveat: by crash, I mean users lose connectivity to the data on the server, so it causes their client app to crash. the server does not crash).
No. Interface counters are clean. Occasionally, an interface reset. Not sure why. Again, I have hard coded speed and duplex on both the server and the switch port.
Are the user's PCs on the same subnet as the servers? I am wondering if there is a router involved, or are the 3560s doing layer 3? Have you tried pings from the default gateway router to the servers? If the traffic is traversing multiple switches you should check the links between switches for errors as well.
Is it a spanning tree issue. Can you provide the configurations of the switches and routers, the "sh ip route" at routers ? Any QoS enabled ?
Here is the "sh ip route" on the 3560 that has a server and users on the same switch:
Admin1-3560#sho ip route
Default gateway is not set
Host Gateway Last Use Total Uses Interface
ICMP redirect cache is empty
Strangely enough, we have a 3560G where 4 fiber links from different buildings come in from, and "sho ip route" is exactly the same.
But here's what's truly weird. We can go a week with NO issues at all, and all of a sudden, clients are dropping connections from the server every 15 minutes. VERY frustrating.
Yes. Both users PCs and servers are on the same subnet. I have not tried pings from the def gw to the server. We have gone down so many roads with this issue, that we have added an entry in the HOSTS file on the client PCs, with the server's IP address, in the hope that the clients won't need to resolve the name.
In one case, both the server and clients are on the SAME switch, and they get bounced. So, dumb question, how would I know if the 3560 is doing layer 3?
If the client and server are on the same switch and the same subnet, then there is no router or layer 3 switching involved. The switch is forwarding the ethernet frames based on the learned mac addresses and the traffic stays on the awitch. If pings are being dropped between the two then it is either cables, NICs, or the switch. You might try different switch ports. If it still fails you need to replace the switch.
We have lots of 3560s and I have yet to see one fail or even be flaky, so I would be surprised if more than one switch is bad. What is your cabling like? Is there more than just a patch cable between the switch and the PC/server?
Spanning tree failures or misconfiguration might be another possibility (as someone else mentioned).
This can only occur if you have redundant links, so you could prove this is not the case by only connecting a single link between switches. Of course it would be better to look at your spanning tree config, but it sounds like you may not be prepared for that.
You are correct in that this is all a big "flat" layer 2 setup. Sorry I was a bit frazzled and didn't comprehend your question!
All cabling is Cat5 minimum, some Cat5e and a little Cat6.
If you tell me what to look at for my spanning tree config, I will give that a look. Like I said, we had CE500s, and not one problem for six weeks. We popped in the 3560s, and almost immediately we started having these errors with the server.
The one that really irks me, is the one you are referring to: server and clients on the same switch. I do a "sho tech suport", and the switch is not busy, CPU usage is very small.
Tell me what to look for for spanning tree stuff, as I am getting pretty desparate! I am also going to put a sniffer on the client and server to capture traffic and see what is going on.
start isolating the issue. Are the hosts and servers in the same subnet/vlan? If not, when you start seeing the issue, are the host able to ping their dfault gateway with no drops at all? Then take the ping a little futher by pinging the default gateway of the server from the host - you did mention you are getting drops on your ping test, the key is finding where the drops is occuring. In addition, do the hosts directly connected to the switch having the same issue? I'm sure that you have voice and data vlan in different vlans but are they? see if the problem is isolated between the routed host to server. And see if the problem is isolated on hosts connected to the phone.
Hope that helps point you in the right direction.
Wow, lot of questions (thanks!) so I will try to hit them all.
Hosts are servers are all in Vlan1.
I have been trying to ping the server during a crash, not the def gw, but will try. Here's a little tidbit: User1 streams a radio station to his PC. The App crashes (appears to lose connectivity to the server), no drops in radio stream, no stutters, stays perfect.
The server and hosts have the same def gw. (Servers and clients are all in 192.168.1.x and in Vlan1).
I do have one situation where clients and the server are on the same switch, and we are getting this issue pretty routinely.
Voice is Vlan10, Data Vlan1.
No, users with PCs directly connected as well as users connected through the phone both crash in this app.