Re: Intermittant loss of connectivity to management interface on

ICowan · ‎06-05-2006

Hi,

I have a very strange issue that has been ocurring on my network for about 4-5 months.

Our NMS (Solarwinds) reports various switches down in our network. There is no pattern to when the switches lose and then re-gain connectivity. I use a rather large management subnet for switches (10.99.0.0/16). There are about 120 switches in this subnet. Never is user connectivity affected by the NMS's inability to ping the management interfaces of the switches.

This is affecting about 6-8 of my switches. They are 2924M-XL and 2924-XLs running 12.0(5)WC13.

I'm still at a loss to why this is occuring but here are some symptoms I have noticed.

1) Memory usage on the switches in questions jumped from about 70 to 82% as reported by our NMS. I don't know why this occured.

2) Our core 6513's ARP cache has an entry for the switch's IP address but pinging does not work.

3) Ping to the default gateway address of the 10.99.0.0 network (10.99.0.201) on an affected switch is always 80% successful.

Here are the results of the arp cache and pings:

COBCore1#show arp | include 10.99.211.1

Internet 10.99.211.1 111 0030.9435.1c00 ARPA Vlan99

COBCore1#clear arp

COBCore1#show arp | include 10.99.211.1

COBCore1#ping 10.99.211.1

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 10.99.211.1, timeout is 2 seconds:

.....

Success rate is 0 percent (0/5)

COBCore1#show arp | include 10.99.211.1

Internet 10.99.211.1 0 0030.9435.1c00 ARPA Vlan99

COBCore1#ping 10.99.211.1

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 10.99.211.1, timeout is 2 seconds:

.....

Success rate is 0 percent (0/5)

COBCore1#

COBCore1#ping 10.99.211.1

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 10.99.211.1, timeout is 2 seconds:

.....

Success rate is 0 percent (0/5)

COBCore1#

WEnforcementSw2#ping 10.99.0.201

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 10.99.0.201, timeout is 2 seconds:

!!.!!

Success rate is 80 percent (4/5), round-trip min/avg/max = 1/5/11 ms

WEnforcementSw2#ping 10.99.0.201

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 10.99.0.201, timeout is 2 seconds:

!!.!!

Success rate is 80 percent (4/5), round-trip min/avg/max = 1/3/6 ms

WEnforcementSw2#ping 10.99.0.99

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 10.99.0.99, timeout is 2 seconds:

!.!!!

Success rate is 80 percent (4/5), round-trip min/avg/max = 1/3/6 ms

WEnforcementSw2#ping 128.1.210.25

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 128.1.210.25, timeout is 2 seconds:

!!!.!

Success rate is 80 percent (4/5), round-trip min/avg/max = 1/7/16 ms

WEnforcementSw2#

I would appreciate any thoughts to why this occuring and suggestions of things to try in order to fix it.

If there is any additional info you need in order to help please ask.

Thanks, Ian.

vladrac-ccna · ‎06-05-2006

What do you see on the switch side?

Can you log into it (console maybe)?

Vlad

IHCowan · ‎06-05-2006

Nothing unusual.

I have TELNETed to some of the switches when the management interface is up and available and things look good. Since they are all at remote sites I have not had a look when the management interface was not available however.

Ian.

glen.grant · ‎06-05-2006

the one that is dropping packets looks like a bad link in the path between your nms and the switch, check for a bad link or a speed /duplex mismatch . As far as the first one you have some kind of routing problem somewhere because it doesn't have a path. If you do a trace to the one that doesn't ping where does it stop? I would start looking there . Also check for things like duplicate addresses on the switches and make sure they are all using the same netmask...

IHCowan · ‎06-05-2006

Everything checks out. The pings drop from the same subnet so I know it's not a routing issue and I've already checked the mask. Don't forget, users traffic (on a different vlan) is never affected.

Thanks, Ian.

Roberto Salazar · ‎06-05-2006

I'm still at a loss to why this is occuring but here are some symptoms I have noticed.

1) Memory usage on the switches in questions jumped from about 70 to 82% as reported by our NMS. I don't know why this occured.

>>get a console connection and capture show proc CPU, see which of the process is hogging the cpu util., if it's really pegged at ~82%

2) Our core 6513's ARP cache has an entry for the switch's IP address but pinging does not work.

>> is the core 6513 the default gateway for the switch having issues? What vlan? what does clearing the arp entry do for you when having the issue?

3) Ping to the default gateway address of the 10.99.0.0 network (10.99.0.201) on an affected switch is always 80% successful.

>> Who is the dafault gateway? Do the hosts connected to the XL's having issue in the same vlan as it's mgmt interface? If so, what is the result if ou ping directly from the host to the XL's mgmt interface. If this ping test does not see any drops, start looking at the link between the XL's and the default gateway or the link going to gateway. Check the interfaces of both sides for any drops. The main point is check if the drop is happening within the XL, meaning when pinging from a directly connected host to the switch mgmt.

Hope that would at least point you the the right track.

Please rate helpful posts.

IHCowan · ‎06-05-2006

1) As luck would have it 1 of the switches I've been having issues just came back. Here is a modified (0% removed) "show cpu process". Unfortunately I don't know what is normal so don't know if any of this is a problem. Anything stand out to your eyes?

SandalwoodYard#show proc cpu

CPU utilization for five seconds: 42%/12%; one minute: 40%; five minutes: 41%

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

3 11381154 1356286 8391 0.00% 0.17% 0.12% 0 Check heaps

9 2502532 4209375 594 0.00% 0.02% 0.00% 0 ARP Input

10 0 1 0 0.00% 0.00% 0.00% 0 RAM Access (dm 0

11 0 1 0 0.00% 0.00% 0.00% 0 Critical Bkgnd

12 1016903 6776457 150 0.00% 0.00% 0.00% 0 Net Background

13 6398 3950 1619 0.00% 0.00% 0.00% 0 Logger

14 3277160 9277200 353 0.10% 0.01% 0.00% 0 TTY Background

15 53416683 27798491 1921 0.30% 0.39% 0.41% 0 Per-Second Jobs

16 49668 319336 155 0.00% 0.00% 0.00% 0 Net Input

17 96518 1915688 50 0.00% 0.00% 0.00% 0 Compute load avg

18 4238003 159605 26553 0.00% 0.04% 0.00% 0 Per-minute Jobs

19 431214092 601695996 716 1.94% 1.87% 1.97% 0 LED Control Proc

20 0 1 0 0.00% 0.00% 0.00% 0 Module Managemen

21 19476792161074516966 1812 14.35% 13.91% 13.22% 0 Port Status Proc

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

22 1050 4098 256 0.00% 0.00% 0.00% 0 VM Prune Events

24 118810333 54230800 2190 0.02% 0.76% 0.72% 0 GDS Frame Ager

25 0 1 0 0.00% 0.00% 0.00% 0 RAM Access (dm 1

26 4158 551 7546 0.00% 0.18% 0.31% 1 Virtual Exec

27 0 1 0 0.00% 0.00% 0.00% 0 Frank Global Mas

30 21 9 2333 0.00% 0.00% 0.00% 0 Module Managemen

32 370858513 333447273 1112 3.17% 2.05% 2.01% 0 Broadcast Storm

38 429713457 46706743 9200 2.07% 2.81% 2.78% 0 Enet Aging

39 1202570 2352515 511 0.00% 0.01% 0.00% 0 IP Input

40 1794043 50614 35445 0.00% 0.01% 0.00% 0 Address Deletion

42 887286 1181408 751 0.00% 0.01% 0.00% 0 CDP Protocol

57 1217988 8079645 150 0.10% 0.01% 0.00% 0 Spanning Tree

58 18973632 162382605 116 0.20% 0.21% 0.20% 0 STP Hello

59 74219045 14112489 5259 0.71% 0.70% 0.71% 0 STP Queue Handle

60 86723 4004 21659 0.00% 0.00% 0.00% 0 Malibu STP Adjus

61 0 1 0 0.00% 0.00% 0.00% 0 Time Range Proce

63 0 2 0 0.00% 0.00% 0.00% 0 Router Autoconf

64 5004 360 13900 0.00% 0.00% 0.00% 0 SNMP ConfCopyPro

65 0 1 0 0.00% 0.00% 0.00% 0 Bridge MIB traps

68 0 2 0 0.00% 0.00% 0.00% 0 VTP Malibu Trap

69 2909357 1916456 1518 0.00% 0.02% 0.00% 0 Runtime diags

70 0 1 0 0.00% 0.00% 0.00% 0 SNMP Timers

71 39334744 2726165 14428 0.00% 0.38% 0.37% 0 IP SNMP

72 0 1 0 0.00% 0.00% 0.00% 0 SNMP Traps

73 667249 9751597 68 0.00% 0.00% 0.00% 0 NTP

SandalwoodYard#

2) The 6513 is the default gateway for all subnets/VLANs. Hosts use a different VLAN and are not affected. A host connected to the 2924M-XL that pings it would trunk to the 6513 (where the host's default gateway is) and then back to the 2924M-XL.

I agree that in order to know for sure where the problem is occuring I'll have to see where the ICMP request is being dropped.

Thanks for your help, Ian.

Roberto Salazar · ‎06-05-2006

42% is about normal for this platform, here is a switch at a lab with no traffic:

2900xl#sh proc cpu

CPU utilization for five seconds: 37%/6%; one minute: 32%; five minutes: 31%

As you can see it's 37%, that's about the normal range.

If the 6513 is the gateway for the hosts and they have no performance issue at all and their path toward the 6513 is the same trunk link you use for that ping test where you see a drop, I would think it's not the link but more of a vlan specific issue. Just to completely rule out that link, can you maybe connect a host directly to that xl and put that port in the same vlan as the mgmt interface? See if you see the same amount of drop. Also, check the following:

show int for that trunk to 6513. look for any errors, drops, etc.

show spanning-tree vlan X >> verify that the root is what it should be.

If the CPU is low and traffic through the switch fine, the packet have to be dropped somewhere, that what you need to find, where is it getting dropped.

Intermittant loss of connectivity to management interface on switches