Re: Occassional packet loss on LAN

chrismgeary · ‎11-10-2005

I have two 3560G-48 Dist switches and 8 2950-48-EI access switches in two groups of 4 uplinked with a fibre link to each 3560. I have RPVST running.

Every now and then, my monitoring server will show an access switch as not responding to ICMP. This always happens after I have made a port vlan change but it doesn't happen every time.

Also, I am monitoring the HSRP address of the VLAN I am on, and out of 4400 pings sent every 2 seconds, 4 failed. This doesn't sound like much, but I thought I would easily be able to show 100% uptime for the LAN rather than the 99.91% it currently shows. There are a few buffer misses but none attributable to memory failure.

Is missing the odd icmp request anything to be worried about? I ask because the other week users complained of loss of connectivity for about 1 minute. However, I was on the native VLAN at the time and experienced no such loss. I checked the logs and there were no STP topology changes, no faults in syslog, no high cpu usage, no memory issues. Nothing that might explain it.

Any suggestions?

glen.grant · ‎11-10-2005

the problem could be anywhere in the path from your monitoring server to whatever you are pinging so you would have to check all connecting links between the monitoring server and the switch . Your not losing many packets by your description but if it is working correctly you shouldn't lose any .

chrismgeary · ‎11-16-2005

I've set up a continuous ping using freeping.

I have it running on my workstation sited on a user VLAN. I've set it up to send an echo to 4 access switches, the default gateway and a server on the server VLAN every 2 seconds. Freeping considers 6 seconds without a response to be a failure.

My workstation is attached to one of the access switches. The access switches are in a group of 4 all linked together with fibre and uplinked to the distribution switches (3560s), also with fibre.

I also have an instance running on my monitoring server currently located on the native VLAN using the same parameters. It is sending 2 second echoes to the same 4 access switches, its own default gateway, the same server on the server VLAN and my workstation.

The monitoring server is attached directly to the distribution switches.

The results are, after approx 25000 echoes (3 days or so):

From my workstation:

all devices have missed echoes, at least 5 but no more than 13. This gives a theoretical reachability value of between 99.95% and 99.98%

From my monitoring server:

The default gateway, one of the access switches and the server VLAN server all achieved 100%, no packet loss whatsoever. Three of the access switches and my workstation missed between 2 and 6 packets, or 99.97% to 99.99%.

What I conclude from this is that aside from the access switches missing the odd packet, packet loss seems to be confined to the access layer, or the fibre link between the two. All intervlan routing is done on the 3650 dist switches, and I've proven that there is no loss between VLANs.

What might cause packet loss between the access and distribution layer?

many thanks

Chris

chrismgeary · ‎11-16-2005

Further to this, I have noticed ignored packets on the 2950 gi uplink interface connecting to the distribution 3560. There are no ignored packets between 2950s nor on the 3560 gi interface.

GigabitEthernet0/1 is up, line protocol is up (connected)

Hardware is Gigabit Ethernet, address is

Description:

MTU 1500 bytes, BW 1000000 Kbit, DLY 1000 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

Full-duplex, 1000Mb/s, link type is auto, media type is SX

output flow-control is off, input flow-control is off

ARP type: ARPA, ARP Timeout 04:00:00

Last input 00:00:00, output 00:00:04, output hang never

Last clearing of "show interface" counters never

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 607000 bits/sec, 241 packets/sec

5 minute output rate 606000 bits/sec, 217 packets/sec

331647176 packets input, 4140142772 bytes, 0 no buffer

Received 237875743 broadcasts (0 multicast)

0 runts, 0 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 6447 ignored

0 watchdog, 230394481 multicast, 0 pause input

0 input packets with dribble condition detected

3745524066 packets output, 754867989 bytes, 0 underruns

0 output errors, 0 collisions, 2 interface resets

0 babbles, 0 late collision, 0 deferred

0 lost carrier, 0 no carrier, 0 PAUSE output

0 output buffer failures, 0 output buffers swapped out

any help greatly appreciated.

Chris

chrismgeary · ‎11-17-2005

even further to this, I have the following data:

2950#sh controllers ethernet-controller gi0/1

Transmit Receive

4286128344 Bytes 1391302101 Bytes

791586893 Frames 1582326414 Frames

2337309 Multicast frames 0 FCS errors

1499404 Broadcast frames 232221456 Multicast frames

0 Pause frames 6703849 Broadcast frames

0 Single defer frames 0 Control frames

0 Multiple defer frames 0 Pause frames

0 1 collision frames 0 Unknown opcode frames

0 2-15 collisions 0 Alignment errors

0 Late collisions 0 Length out of range

0 Excessive collisions 0 Symbol error frames

0 Total collisions 0 False carrier errors

0 Control frames 0 Valid frames, too small

0 VLAN discard frames 0 Valid frames, too large

0 Too old frames 0 Invalid frames, too small

777843660 Tagged frames 0 Invalid frames, too large

0 Aborted Tx frames 19205 Discarded frames

So the problem is definitely on the fibre link between the 3560 and the 2950. No links between 2950s show errors, nor does the redundant link to the other 3560. Both switch groups, I have two, show the same problems. The 3560 shows no errors.

Could there be some sort of compatibility issue between the 3560 and 2950? I have two different IOS versions on the 2950s in one group and the other.

*scratching my head*

Chris

chrismgeary · ‎11-30-2005

since I clear the interface counters, the ignored count has not increased, so this is probably a red herring.

my continuous pings show conclusively that:

from the monitoring server connected directly to a 3560 dist switch:

all 2950s in the access layer have dropped packets (36 lost out of 140,000)

there is a temporary 3550 running the standard image in the access layer. this has not dropped a single packet!

no packets within the distribution layer were dropped.

the temporary 3550 connects directly to the 3560 dist switch.

so this suggests that its just the 2950s with the problem. can anyone help me track this down further?

thanks!

Chris

glen.grant · ‎11-30-2005

A simple thing to check also might be to make sure all client ports are setup with portfast so that TCN's aren't being sent every time someone log onto a port . Can you correspond anything to any of the logs in the 2950"s ? We have a lot of 2950's around and we haven't noticed any problems with them , don't have any 3560's but a few 3750's with no known problems . If you have access to the bugtool you might want to take a look around to see if any of the bugs for 2950's might apply to your situation .

chrismgeary · ‎12-02-2005

all client ports are set to portfast. nothing gets logged when packets get dropped.

here is a weird thing. previously we had two vlans on our third floor. there wasnt really any need for them, but two of the access switches were vlan2 and the other two were vlan3.

ever since I consolidated all the user ports into vlan 3 (except for my workstation), my continuous ping has showed no packet loss from my workstation. some 29000 echoes later. however, the problem still appears to exist from the monitoring server. the monitoring server pings the same devices as my workstation. now i'm really confused! perhaps im chasing a problem that isn't really a problem.