06-05-2006 08:00 AM - edited 03-03-2019 03:30 AM
Hi,
I have a very strange issue that has been ocurring on my network for about 4-5 months.
Our NMS (Solarwinds) reports various switches down in our network. There is no pattern to when the switches lose and then re-gain connectivity. I use a rather large management subnet for switches (10.99.0.0/16). There are about 120 switches in this subnet. Never is user connectivity affected by the NMS's inability to ping the management interfaces of the switches.
This is affecting about 6-8 of my switches. They are 2924M-XL and 2924-XLs running 12.0(5)WC13.
I'm still at a loss to why this is occuring but here are some symptoms I have noticed.
1) Memory usage on the switches in questions jumped from about 70 to 82% as reported by our NMS. I don't know why this occured.
2) Our core 6513's ARP cache has an entry for the switch's IP address but pinging does not work.
3) Ping to the default gateway address of the 10.99.0.0 network (10.99.0.201) on an affected switch is always 80% successful.
Here are the results of the arp cache and pings:
COBCore1#show arp | include 10.99.211.1
Internet 10.99.211.1 111 0030.9435.1c00 ARPA Vlan99
COBCore1#clear arp
COBCore1#show arp | include 10.99.211.1
COBCore1#ping 10.99.211.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.99.211.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
COBCore1#show arp | include 10.99.211.1
Internet 10.99.211.1 0 0030.9435.1c00 ARPA Vlan99
COBCore1#ping 10.99.211.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.99.211.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
COBCore1#
COBCore1#ping 10.99.211.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.99.211.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
COBCore1#
WEnforcementSw2#ping 10.99.0.201
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.99.0.201, timeout is 2 seconds:
!!.!!
Success rate is 80 percent (4/5), round-trip min/avg/max = 1/5/11 ms
WEnforcementSw2#ping 10.99.0.201
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.99.0.201, timeout is 2 seconds:
!!.!!
Success rate is 80 percent (4/5), round-trip min/avg/max = 1/3/6 ms
WEnforcementSw2#ping 10.99.0.99
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.99.0.99, timeout is 2 seconds:
!.!!!
Success rate is 80 percent (4/5), round-trip min/avg/max = 1/3/6 ms
WEnforcementSw2#ping 128.1.210.25
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 128.1.210.25, timeout is 2 seconds:
!!!.!
Success rate is 80 percent (4/5), round-trip min/avg/max = 1/7/16 ms
WEnforcementSw2#
I would appreciate any thoughts to why this occuring and suggestions of things to try in order to fix it.
If there is any additional info you need in order to help please ask.
Thanks, Ian.
06-05-2006 08:25 AM
What do you see on the switch side?
Can you log into it (console maybe)?
Vlad
06-05-2006 08:46 AM
Nothing unusual.
I have TELNETed to some of the switches when the management interface is up and available and things look good. Since they are all at remote sites I have not had a look when the management interface was not available however.
Ian.
06-05-2006 08:50 AM
the one that is dropping packets looks like a bad link in the path between your nms and the switch, check for a bad link or a speed /duplex mismatch . As far as the first one you have some kind of routing problem somewhere because it doesn't have a path. If you do a trace to the one that doesn't ping where does it stop? I would start looking there . Also check for things like duplicate addresses on the switches and make sure they are all using the same netmask...
06-05-2006 11:20 AM
Everything checks out. The pings drop from the same subnet so I know it's not a routing issue and I've already checked the mask. Don't forget, users traffic (on a different vlan) is never affected.
Thanks, Ian.
06-05-2006 09:48 AM
I'm still at a loss to why this is occuring but here are some symptoms I have noticed.
1) Memory usage on the switches in questions jumped from about 70 to 82% as reported by our NMS. I don't know why this occured.
>>get a console connection and capture show proc CPU, see which of the process is hogging the cpu util., if it's really pegged at ~82%
2) Our core 6513's ARP cache has an entry for the switch's IP address but pinging does not work.
>> is the core 6513 the default gateway for the switch having issues? What vlan? what does clearing the arp entry do for you when having the issue?
3) Ping to the default gateway address of the 10.99.0.0 network (10.99.0.201) on an affected switch is always 80% successful.
>> Who is the dafault gateway? Do the hosts connected to the XL's having issue in the same vlan as it's mgmt interface? If so, what is the result if ou ping directly from the host to the XL's mgmt interface. If this ping test does not see any drops, start looking at the link between the XL's and the default gateway or the link going to gateway. Check the interfaces of both sides for any drops. The main point is check if the drop is happening within the XL, meaning when pinging from a directly connected host to the switch mgmt.
Hope that would at least point you the the right track.
Please rate helpful posts.
06-05-2006 11:32 AM
1) As luck would have it 1 of the switches I've been having issues just came back. Here is a modified (0% removed) "show cpu process". Unfortunately I don't know what is normal so don't know if any of this is a problem. Anything stand out to your eyes?
SandalwoodYard#show proc cpu
CPU utilization for five seconds: 42%/12%; one minute: 40%; five minutes: 41%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
3 11381154 1356286 8391 0.00% 0.17% 0.12% 0 Check heaps
9 2502532 4209375 594 0.00% 0.02% 0.00% 0 ARP Input
10 0 1 0 0.00% 0.00% 0.00% 0 RAM Access (dm 0
11 0 1 0 0.00% 0.00% 0.00% 0 Critical Bkgnd
12 1016903 6776457 150 0.00% 0.00% 0.00% 0 Net Background
13 6398 3950 1619 0.00% 0.00% 0.00% 0 Logger
14 3277160 9277200 353 0.10% 0.01% 0.00% 0 TTY Background
15 53416683 27798491 1921 0.30% 0.39% 0.41% 0 Per-Second Jobs
16 49668 319336 155 0.00% 0.00% 0.00% 0 Net Input
17 96518 1915688 50 0.00% 0.00% 0.00% 0 Compute load avg
18 4238003 159605 26553 0.00% 0.04% 0.00% 0 Per-minute Jobs
19 431214092 601695996 716 1.94% 1.87% 1.97% 0 LED Control Proc
20 0 1 0 0.00% 0.00% 0.00% 0 Module Managemen
21 19476792161074516966 1812 14.35% 13.91% 13.22% 0 Port Status Proc
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
22 1050 4098 256 0.00% 0.00% 0.00% 0 VM Prune Events
24 118810333 54230800 2190 0.02% 0.76% 0.72% 0 GDS Frame Ager
25 0 1 0 0.00% 0.00% 0.00% 0 RAM Access (dm 1
26 4158 551 7546 0.00% 0.18% 0.31% 1 Virtual Exec
27 0 1 0 0.00% 0.00% 0.00% 0 Frank Global Mas
30 21 9 2333 0.00% 0.00% 0.00% 0 Module Managemen
32 370858513 333447273 1112 3.17% 2.05% 2.01% 0 Broadcast Storm
38 429713457 46706743 9200 2.07% 2.81% 2.78% 0 Enet Aging
39 1202570 2352515 511 0.00% 0.01% 0.00% 0 IP Input
40 1794043 50614 35445 0.00% 0.01% 0.00% 0 Address Deletion
42 887286 1181408 751 0.00% 0.01% 0.00% 0 CDP Protocol
57 1217988 8079645 150 0.10% 0.01% 0.00% 0 Spanning Tree
58 18973632 162382605 116 0.20% 0.21% 0.20% 0 STP Hello
59 74219045 14112489 5259 0.71% 0.70% 0.71% 0 STP Queue Handle
60 86723 4004 21659 0.00% 0.00% 0.00% 0 Malibu STP Adjus
61 0 1 0 0.00% 0.00% 0.00% 0 Time Range Proce
63 0 2 0 0.00% 0.00% 0.00% 0 Router Autoconf
64 5004 360 13900 0.00% 0.00% 0.00% 0 SNMP ConfCopyPro
65 0 1 0 0.00% 0.00% 0.00% 0 Bridge MIB traps
68 0 2 0 0.00% 0.00% 0.00% 0 VTP Malibu Trap
69 2909357 1916456 1518 0.00% 0.02% 0.00% 0 Runtime diags
70 0 1 0 0.00% 0.00% 0.00% 0 SNMP Timers
71 39334744 2726165 14428 0.00% 0.38% 0.37% 0 IP SNMP
72 0 1 0 0.00% 0.00% 0.00% 0 SNMP Traps
73 667249 9751597 68 0.00% 0.00% 0.00% 0 NTP
SandalwoodYard#
2) The 6513 is the default gateway for all subnets/VLANs. Hosts use a different VLAN and are not affected. A host connected to the 2924M-XL that pings it would trunk to the 6513 (where the host's default gateway is) and then back to the 2924M-XL.
I agree that in order to know for sure where the problem is occuring I'll have to see where the ICMP request is being dropped.
Thanks for your help, Ian.
06-05-2006 12:21 PM
42% is about normal for this platform, here is a switch at a lab with no traffic:
2900xl#sh proc cpu
CPU utilization for five seconds: 37%/6%; one minute: 32%; five minutes: 31%
As you can see it's 37%, that's about the normal range.
If the 6513 is the gateway for the hosts and they have no performance issue at all and their path toward the 6513 is the same trunk link you use for that ping test where you see a drop, I would think it's not the link but more of a vlan specific issue. Just to completely rule out that link, can you maybe connect a host directly to that xl and put that port in the same vlan as the mgmt interface? See if you see the same amount of drop. Also, check the following:
show int for that trunk to 6513. look for any errors, drops, etc.
show spanning-tree vlan X >> verify that the root is what it should be.
If the CPU is low and traffic through the switch fine, the packet have to be dropped somewhere, that what you need to find, where is it getting dropped.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: