Strange issue with new Cisco Catalyst 2960 (IOS) - Page 3

longboringusername · ‎01-29-2014

Hello all,

I am upgrading a older 2950(100M) switch replacing it with a gigabit 2960. Installed it in the same rack, the configuration is practically non-existent just set the passwords and IP. We run a single VLAN flat network for this so I started out by patching it to the existing switch, after a few days we had an opportunity to migrate because there was some downtime so I disconnected the cables on the old and moved them to the new.. Everything seemed fine, there is connectivity and things operate, but a few days later we noticed that some network transfer activities are slow. There are no errors or log entries showing on the new switch or the old one, but the low throughput is persistent.

All ports show 1G Full duplex as they should, but what I see when I test is that traffic tests look almost asynchronous when passing switch boundaries with normal read speeds and slow writes. Reversing the direction of the test hosts I get slow reads and fast writes so it seems to 'stick' to one side of the traffic path. Testing the same equipment against differente targets without the switch boundary crossing does not show the problem. All Intra-switch tests look good (gig switches transfer near a gig and 100 switches near 100), but the moment there is a crossing things behave strangely regardless of the target (new switch is center backbone with most hosts, but does no routing). Network layout is essentially a T with everything radiating from the new switch. I can eliminate the old switch soon, but I still need to resolve the problem with the crossing to the other switch.

Everything seems to point at the inter switch links. One is a patch cable under two feet, and the other is a dedicated fiber site link. We had the vendor confirm that the site link showed no issues, but having the same symptoms on both links makes me suspect the switch has something odd happening..

I checked for duplex issues first, but didn't find any. I flushed the arp caches in all of the switches (3 total) and all of the computers as well, but the problem persists.

Could this be an STP issue ? If so how can I set this switch as the STP root and force a refresh..

Any help would be greatly appreciated.

longboringusername · ‎02-03-2014

Hello Again,

Ok, I still feel a bit stupid about that but it did not resolve the problems. It did resolve the 'no link' issue when forcing the interface to 100/full. Since we saw these problems in two places I checked everything between the switches and also had the ISP re-check their end. I found one of the ISP feed cables had also been wired straight through instead of crossed and that was preventing us from configuring that part of the link as 100/full. I corrected that and with all of the cable sillies out of the way I coordinated with them to get both ends of the link fully locked down at 100/full duplex on Sunday in the early morning.

I was very hopeful that would fix the problems completely, but it did not.. The links are now at 100/full and I reset counters and repeated the testing process getting the same results. It seems a bit less extreme than it was before so I suspect this contribuyted to the problem, but there is still an imbalance of the speed on the link between sites. Basically the configurations of all involved systems have been locked down as suggested (all linking ports fixed at speed and duplex, no cdp enable, switchport host on all access ports. ip gateway set, new switch spanning-tree priority zero on vlan 1, spanning-tree portfast bpduguard default on switch). The imbalance basically means we show about 90-97m on one side and 23 - 32M on the other with a lot more fluctuation in speed as well (the numbers are lower under load but with the same imbalance).

Since the old switch was obviated on Sunday as well there is now only a single link with this problem.. This occurs on the fiber link between our two sites. When tested primary to secondary the results are a slow write throughput, in the other direction it shows as slow read throughput. The 'remote' or secondary site shows a very small number of drops and some invalid small packets, but the primary site still shows a much larger number of output drops (but its show controllers stats are now clean and free of collisions).

Since I was already causing short interruptions I decided to try shifting the link back to the older switch to see if it cleared up our output drops. Sadly it made them even worse.

Any suggestions for where to look next ?

Dave

PS: Had another thought.. Could this simply be buffer overrun ? This link is 100M and is being fed by a 1G switch with 1G hosts. If the output queue is small (in the interface info I saw 40 output buffers max). Since the incoming feed is fast, coundn't the traffic overflow the output buffer chain causing output drops? I figured this was one of the reasons for configuring the interface at 100M, to make sure the switch queuing mechanisms know about the speed difference for buffering. Is there something else that needs to be configured for this kind of speed reduction to operate cleanly without drops ? The line is nowhere near saturation (less than 10% utilization) according to the ISPs mointoring gear, but we have a persistent performance issue..

paul driver · ‎02-04-2014

Hello

can you do a copule of show commands

sh int xxx | in O|I

sh int xxx switching

sh processes cpu sorted | in PID|Ouput|Input

res

Paul

Please don't forget to rate any posts that have been helpful.

Thanks.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

longboringusername · ‎02-04-2014

Hi Paul,

Sorry for the delay, was out of office today and didn't have access. Show command output is below.

Last night I read some worrying information on a network engineering site that mentioned that the ASIC output buffers on this line of switches are inadequate by design and often have this kind of problem, creating oversubscription issues on a nearly constant basis on any downspeed links or heavily loaded links.. They seemed to basically trash this whole range of switches saying there was no way to really resolve this issue because it is an ASIC design limitation hard-coded within the actual device. I certainly hope that is not true. Having purchased several of these I can tell you I would not be a happy camper..

While I realize that traffic to and from the faster hosts will create microbursts that exceed 100M, having such weak output buffering that we can't even use 40% of the bandwidth due to the drops(meaning it doesn't even reach the media before failing) really would truly fit my description of a bad hardware design. Hell even most cheap off brand no name switches handle downspeed links well enough to get better than 40% utilization out of 100M link..

Output (the second line is the 40 buffer item I saw that worried me):

CISCO-2960-48-GB-ASP#sh int Gi1/0/47 | in O|I

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 682329

Output queue: 0/40 (size/max)

CISCO-2960-48-GB-ASP#sh int Gi1/0/47 switching

GigabitEthernet1/0/47

Throttle count 0

Drops RP 0 SP 0

SPD Flushes Fast 0 SSE 0

SPD Aggress Fast 0

SPD Priority Inputs 0 Drops 0

Protocol Path Pkts In Chars In Pkts Out Chars Out

Other Process 0 0 1228888 80619660

Cache misses 0

Fast 0 0 0 0

Auton/SSE 0 0 0 0

Spanning Tree Process 0 0 220052 14081180

Cache misses 0

Fast 0 0 0 0

Auton/SSE 0 0 0 0

CDP Process 0 0 124944 59973096

Cache misses 0

Fast 0 0 0 0

Auton/SSE 0 0 0 0

VTP Process 0 0 1998 197802

Cache misses 0

Fast 0 0 0 0

Auton/SSE 0 0 0 0

DTP Process 20337 1220220 0 0

Cache misses 0

Fast 0 0 0 0

Auton/SSE 0 0 0 0

CISCO-2960-48-GB-ASP#sh processes cpu sorted | in PID|Ouput|Input

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

179 2042364 15702845 130 0.09% 0.06% 0.02% 0 IP Input

50 45 735 61 0.00% 0.00% 0.00% 0 Net Input

10 197808 1150945 171 0.00% 0.00% 0.00% 0 ARP Input

238 0 1 0 0.00% 0.00% 0.00% 0 RARP Input

287 0 2 0 0.00% 0.00% 0.00% 0 CSRT RAPID TRANS

Dave

longboringusername · ‎02-10-2014

Hello all,

After struggling with this for a while longer I opened a TAC case or it and was basically told that this was normal behaviour with oversubscription (blah blah blah). I have continued to insist that the egress buffering on this switch is severly inadequate and essentially makes it impossible to use in a mixed speed migration environment. No real solution proferred by TAC yet, but I would definitely suggest avoiding these switches in any mixed speed or oversubscibed environment. Rather than reaching line saturation (or anytwhere close to it) the egress buffers start to overflow very quickly (after just a few seconds) yielding well under 50% utlilization of the downspeed (or oversubscribed) link.

While it is only a personal opinion, I beleive any device that cannot achieve 80-85% utilization in such an environment should not advertise itself as being multispeed compatible.. This device has such inadequate egress buffering that it cannot dampen even a short term burst and starts to flail in seconds never truly recovering. Although I've seen this with no name switches, with all of the queuing logic and backplane capacity Cisco goes on and on about I never expected to see it on this class of equipment.

more at:

http://www.gossamer-threads.com/lists/cisco/nsp/173830

I am wide open to suggestions here. For the moment I have connected an older C2950T between the new Gigabit switch and the link moving the interswitch link to gigabit so that only the 'old' switch deals with the 100M link. That was enough to dampen the effect so I can get 90+ % utilization of the 100M link rather than the 15-40% I got directly attaching the link to this switch due to its lack of adequate buffering.

Lovely.. A circa 2005 switch is required to repair the impaired functionality after installing a newer model.

Singularly unimpressive.

Dave

Leo Laohoo · ‎02-10-2014

GigabitEthernet1/0/48 is up, line protocol is up (connected)

Hardware is Gigabit Ethernet, address is f41f.c2dc.9bb0 (bia f41f.c2dc.9bb0)

MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,

reliability 255/255, txload 2/255, rxload 10/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

Full-duplex, 100Mb/s, media type is 10/100/1000BaseTX

input flow-control is off, output flow-control is unsupported

ARP type: ARPA, ARP Timeout 04:00:00

Last input 00:00:01, output 00:00:02, output hang never

Last clearing of "show interface" counters never

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 12712290

This means that whatever is the downstream device connected to this port is sending back (to the switch) a pause request. The switch then puts the continuous stream of packets into a bucket or buffer until this bucket gets full and the counter increments. The reason why you are observing slow responses is because of the resend. The downstream device simply can't handle the continuous stream downwards.

Set the speed using the interface command of "speed auto 10 100" instead of just "speed 100".

longboringusername · ‎02-10-2014

Hi Leo,

That is an interesting suggestion, but I cannot see how allowing it to auto negotiate only 10 or 100 would help here.

I am willing to try it, but I cannot do it until I have some downtime. Our users are already unhappy with this situation so we are trying to avoid any changes for a while. I am also hoping to hear something difinitive from TAC..

Dave

Leo Laohoo · ‎02-10-2014

That is an interesting suggestion, but I cannot see how allowing it to auto negotiate only 10 or 100 would help here.

Sorry, I was confusing everyone.

Try avoid using the command "speed 100". This basically means you set the speed to 100 Mbps but you disable auto-negotiate. You also disable MDI/MDI-X.

If you say "speed 10 100" then auto-negotiation is still functional but MDI/MDI-X is still enabled.

Going back to your initial issue, how about this ... Set the speed down further, use "speed auto 10". Clear the counters and observe if the counters increment. Make sure you have STP enabled if this is a Layer 3 device (like a server).

longboringusername · ‎01-31-2014

Hi Paul,

The entire config is earlier in the thread.. It is not 'enabled' in the config, must be a default in this version of 2960.

CISCO-2960-48-GB-ASP#show mls qos

QoS is disabled

QoS ip packet dscp rewrite is enabled

Not sure where to look, I was tracking along a technote about output drops when I found that.

http://www.cisco.com/en/US/products/hw/switches/ps5023/products_tech_note09186a0080c097bb.shtml

Dave

paul driver · ‎01-31-2014

Hello

Okay, This looks like it refering to the ouput queues and as know you are showing drops on this- but qos isnt enabled?

Maybe an IOS change is in order for this switch?

Lets perform the changes I suggested pevoisuly first and then if you still having issues we should go for a ios change.

What current IOS is this 2960 running 12.2(55)SE7x

res

Paul

Please don't forget to rate any posts that have been helpful.

Thanks.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

longboringusername · ‎01-31-2014

Yikes..

The switch is fairly new and has a recent IOS

WS-C2960S-48TD-L 12.2(55)SE7 C2960S-UNIVERSALK9-M

Compiled Mon 28-Jan-13 10:28 by prod_rel_team

c2960s-universalk9-mz.122-55.SE7.bin

(the version you mention is the 'remote' switch)

Do you really think it could require a update/upgrade ?

I will take care of the other settings today, I had only configured a couple of host ports and the priority yesterday before getting called away. I can't change the uplink ports until Sunday, but I can configure the access ports and enable bdupguard today.

Thanks again.

paul driver · ‎01-31-2014

Hello

12.2(55)SE7 looks clean however doesnt mean it is!

15.0.2-SE5 also this does aswell?

res

Paul

Please don't forget to rate any posts that have been helpful.

Thanks.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul