HELP! Unexplainable network connectivity issues...

gvb · ‎03-12-2007

Archetecture is as follows:

- 3560G-48TS as the core (L3 routing, VTP server, etc...), running IP Base. Very basic configs, Vlan1 for management, another VLAN for about 25 IP Phones, another VLAN with about 50 desktops/servers.

- All other switches are uplinked to this core 3560. Another 3560G-48TS, a pair of 3560POE-48S, a 3560POE-24, a 2970 and a 2950.

I recently setup a NMS box running OpenNMS, Cacti, and Rancid to do some basic management functions for the network and noticed a couple of things.

Cacti shows a few of the switches (mostly the core) as unavailable via SNMP, and VERY high latency. I ran an extended ping and dropped about .8% of packets. Also note, that both Cacti and OpenNMS show drops to not only switches, but to servers as well. This coorelates with another NMS box running some Dell IT Assistant SNMP client which also shows the same connectivity issues (servers "bounce" up and down all day)

All switches are sitting at around 5-10% CPU utilization, and this network should be sleeping.

Right now, the NMS box is connected via a "edge" swtich. I will connect it directly to the core to eliminate the possibility of an uplink problem on the switch the box is currently connected to, but I am also loosing connectivity to servers connected to other switches.

Any ideas?

kev.whelan · ‎03-12-2007

Do you have a duplex mismatch anywhere on the system ?

Are you taking logs back to a syslog server ?

gvb · ‎03-12-2007

As far as I know, there are no duplex/speed mismatch issues anywhere. No syslog, but I have local buffer debug logging enabled and don't see anything strange. I have all traps going to the OpenNMS system and don't see anything there either.

johnnylingo · ‎06-18-2007

I've been running in to something similar with our 3560s here. No dropped packets, but the ping times to the switch's management interface are 200+ ms on average. Ping times to the end workstations and phones are just fine. We haven't noticed any problems from the user side. IOS versions are 12.2(25)SEE2 and 12.2(25)SEE3 w/ IP Services feature set.

I haven't been able to pin it down exactly, but believe somehow Spanning Tree related. Shutting down redundant uplinks seems to fix it, and converting to Rapid PVST helps as well. Also the installer had used portfast on trunk links, and I'm wondering if that may have something to do with it.

johnnylingo · ‎06-28-2007

In my case, it was the ports being configured as trunks that was the root cause. Since spanning tree was running for all VLANs (we have 150) on all 48 ports, the CPU was about 20% on average. Once I converted the ports to access, CPU load dropped to 5% and ping times were consistantly under 1ms.

avmabe · ‎06-19-2007

For one, don't trust the cacti ping latency. I have seen differences in versions of cacti where some work/calculate correctly and others do not.

My current version is 0.8.6i and reports ping times incorrectly. I do not use ping within cacti for /anything/.

Not sure why they don't respond to SNMP. When you walk the device to generate graphs, does it work?

Rule of thumb also, don't trust pings to switches or routers. Your best method of testing latency if /through/ a switch or router to an end host. For this, I use Smokeping which generates great graphs for latency/loss/jitter.

Also, just for good measure, make sure you set the core switch to STP root bridge for each vlan, if you have not done so already.

I'd be interested to hear more because I have the an environment that is the same as yours.

dave.keith · ‎06-19-2007

This is excellent advice from avmabe. It bears repeating "don't trust pings to switches and routers" !

We use a small free utility called Q-Check, which was from NetIQ but is now distributed by ixia :

http://www.ixiacom.com/products/performance_applications/pa_display.php?skey=qcheck

Another great little free app is GETIF, which is a simple manual SNMP get'r/set'r. This could give you a second opinion about the SNMP on the switches and routers.

One question : Are there problems being seen / experienced by the users or are you just seeing these problems when looking at NMS data ? I ask because it can be very easy to make up your own phantom problems when you get new tools that give you a different view.

Dave