Strange performance issue

Unanswered Question
Mar 26th, 2008
User Badges:
  • Red, 2250 points or more

One of our customers currently experiences a strange performance issue. Basically, the problem is that the transfer of large files is slow when retrieving a large file from a server (Win2k3)that is connected on a gigabit interface to a workstation (Win-XP) on fast ethernet. For example a 130MB file can be uploaded in appr 12-15s, downloading it from the server takes about twice as long (31s).

The customer has all his servers attached to a 3750-stack, the workstations are on 3560-switches and the core is a redundant 6500. The exact network topology does not seem to make much difference though; I am also able to reproduce it on the 3750 stack by forcing a port to 100Mb and attaching a workstation to it.

There are no duplex mismatches in the network and the MTU size is 1500 for the whole network. Also, the gig-link on the server was not loaded, we used a dedicated testserver for our tests. Fiddling with the server's registry only produced worse results.

In the troubleshooting process, I already upgraded the IOS version of the 3750's from 12.2.25 to 12.2.40 but this made no difference. Also, the complaints have started after we installed the 3750 stack.

To add to the weirdness of this, the customer reported that the problem was not present on workstations running Vista or Linux. That was where I stopped investigating a month ago and I considered the issue a non-Cisco problem. Unfortunately we have just been informed that the problem is also absent when the switch is a 3Com instead of a 3750. My impression is that it might be related to packetbuffering but I find the issue very hard to troubleshoot.

I will go to the customer next week to make some traces but any input from anyone of you is highly appreciated.


Leo

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 3.9 (7 ratings)
Loading.
graemeporter Wed, 03/26/2008 - 02:09
User Badges:
  • Bronze, 100 points or more

Might this be to do with QoS (Quality of Service / packet shaping)? It's a possibility if your customer uses IP Telephony.


If QoS is implemented incorrectly, then there may be less bandwidth available for data on certain ports, with the remainder being reserved for voice traffic.


Additionally, if there are two VLANs involved, then the router may be configured to perform packet shaping in one direction; if the device that IS able to download quickly from the server happens to be on the same VLAN as the server, the packets might not hit the router, so no packet shaping - whereas the devices which take twice as long to download from the server are not on the same VLAN as the server, the packets must go through the router, and may be subject to packet shaping there.


Basically, I don't know for sure - but it sounds like a plausible cause for this problem, in my head at least :)


Hope this is of some help to you - let me know how you get on.


Kind regards

Graeme

lgijssel Wed, 03/26/2008 - 03:02
User Badges:
  • Red, 2250 points or more

Found it hard to capture all that has already been done about this in the first message.

QoS is not enabled on the 3750 switches and still we can reproduce the issue there.

Packet shaping is not configured here either, nor in the rest of the network.

The servers are largely in vlan 16, the test-pc was in the citrix-vlan (2). Be aware that the Citrix servers (on gigabit links) do not have any performance problems.


This is the vlan config of the 3750, pretty basic stuff I think:

interface Vlan1

description ILO & mgmt

ip address 172.23.8.254 255.255.255.0

ip helper-address 172.23.1.3

ip helper-address 172.23.1.1

ip helper-address 172.23.1.2

!

interface Vlan2

description Citrix

ip address 172.23.2.254 255.255.255.0

ip helper-address 172.23.1.3

ip helper-address 172.23.1.1

ip helper-address 172.23.1.2

!

interface Vlan3

description GIS

ip address 172.23.3.254 255.255.255.0

!

interface Vlan4

description SRV-PA

ip address 172.23.4.254 255.255.255.0

!

interface Vlan16

description SRV01

ip address 172.23.1.254 255.255.255.0

!

interface Vlan104

ip address 172.23.255.6 255.255.255.252

!

interface Vlan108

ip address 172.23.255.10 255.255.255.252

!

interface Vlan112

ip address 172.23.255.14 255.255.255.252

!

interface Vlan116

ip address 172.23.255.18 255.255.255.252

!

vlan 104 - 116 are the layer3 uplinks to the coreswitches. Load balancing is performed by cef and currently we are allowing only two links to carry the traffic to/from the core. Using only one link did not give any improvement.


VoIP is not used here. There is some QoS in the network but this is merely to trust traffic from one particular port that is not yet in use. Still, I will test if it makes any difference when this is disabled. Keeping you posted!


regards,

Leo

royalblues Wed, 03/26/2008 - 04:03
User Badges:
  • Green, 3000 points or more

Check for drops at the ASIC level which could be one of the cause


The commands sh platform pm if-numbers & sh platform port-asic stats drop asic will tell you this


You might have to tune your interface queues based on this


HTH

Narayan


lgijssel Wed, 03/26/2008 - 04:38
User Badges:
  • Red, 2250 points or more

On a most ports of the 3750, all counters are zero, a few have higher numbers but they do not increase when viewing them a second (or third) time.

The figures were aggregated over a running period of 3 months so this does not seem to me as a large number.

This was the highest counter:

L34AM#sh platform port-asic stats drop asic 9


Port-asic Port Drop Statistics - Summary

========================================

RxQueue 0 Drop Stats: 0

RxQueue 1 Drop Stats: 0

RxQueue 2 Drop Stats: 0

RxQueue 3 Drop Stats: 0


Port 0 TxQueue Drop Stats: 20651

Port 1 TxQueue Drop Stats: 411918

Port 2 TxQueue Drop Stats: 0

Port 3 TxQueue Drop Stats: 815


QoS setting for the 3750 stack:

L34AM#sh mls qos

QoS is disabled

QoS ip packet dscp rewrite is enabled


On a random 3560 (= access-switch, we have about twenty of them) I get the result below:

sw41d21#sh platform port-asic stats drop asic 0

Port-asic Port Drop Statistics - Summary

========================================

RxQueue 0 Drop Stats: 0

RxQueue 1 Drop Stats: 0

RxQueue 2 Drop Stats: 0

RxQueue 3 Drop Stats: 0


Port 0 TxQueue Drop Stats: 0

Port 1 TxQueue Drop Stats: 0

Port 2 TxQueue Drop Stats: 103701

Port 3 TxQueue Drop Stats: 135151

Port 4 TxQueue Drop Stats: 0

Port 5 TxQueue Drop Stats: 86154

Port 6 TxQueue Drop Stats: 84042

Port 7 TxQueue Drop Stats: 86836

Port 8 TxQueue Drop Stats: 0

Port 9 TxQueue Drop Stats: 0

Port 10 TxQueue Drop Stats: 78547

Port 11 TxQueue Drop Stats: 240202

Port 12 TxQueue Drop Stats: 0

Port 13 TxQueue Drop Stats: 0

Port 14 TxQueue Drop Stats: 304858

Port 15 TxQueue Drop Stats: 385277

Port 16 TxQueue Drop Stats: 4442

Port 17 TxQueue Drop Stats: 169403

Port 18 TxQueue Drop Stats: 28999

Port 19 TxQueue Drop Stats: 2424

Port 20 TxQueue Drop Stats: 184948

Port 21 TxQueue Drop Stats: 179587

Port 22 TxQueue Drop Stats: 119822

Port 23 TxQueue Drop Stats: 271323

Port 24 TxQueue Drop Stats: 18

Port 25 TxQueue Drop Stats: 120535

(output abbreviated)

sw41d21#sh mls qos

QoS is enabled

QoS ip packet dscp rewrite is enabled


However, QoS is only there to preserve marked packets, no queueing mechanism is used.

Port 25 is likely int gi0/2:

sw41d21#sh queueing int gi0/2

Interface GigabitEthernet0/2 queueing strategy: none

This is a typical port-config, they are all like this:

interface GigabitEthernet0/2

description Stackport to sw40d21

switchport trunk encapsulation dot1q

switchport mode trunk

mls qos trust cos


And the uptime of this switch:

sw41d21 uptime is 27 weeks, 3 days, 20 hours, 56 minutes


regards,

Leo

royalblues Wed, 03/26/2008 - 04:46
User Badges:
  • Green, 3000 points or more

When you enable QoS, the queues by default would get equally distributed into 4 queues and the traffic from the server might be overflowing one of them.


Is there a way you can just disable QoS for testing. If things improve we might try tuning the queues later on


Narayan

Joseph W. Doherty Wed, 03/26/2008 - 05:05
User Badges:
  • Super Bronze, 10000 points or more

Something else to consider, even though you're not seeing the drop counters increment as you watch, the fact you have any drops counted at all indicates there has been transient congestion bad enough to overflow the queue. You might still have transient congestion which doesn't overflow the queue.


Why is this import? TCP will only hit full speed when its receive window is at least as large as the bandwidth delay product (BDP). Transient congestion, making transient delays, could slow effective TCP transfer rates if default TCP receive window isn't large enough.

Joseph W. Doherty Wed, 03/26/2008 - 04:15
User Badges:
  • Super Bronze, 10000 points or more

Many of the recent Windows TCP stacks, pre-Vista, adjust their TCP receive window based on the Ethernet port speed. Not enough information to declare this is the cause of your issue, but something to be aware of.

lgijssel Wed, 03/26/2008 - 05:41
User Badges:
  • Red, 2250 points or more

Guys, also thanks to your input we have reached breakthrough! Graeme, I have absolutely underrated your first response which was 100% on-target. I will make this up with you.

I have always thought that mls qos was needed to do things like trusting packet-markings which is why it was here in the first place.

We have ample bandwidth so the more complex challenges of policing and shaping are not at all needed here.

I have now managed to resolve the issue remotely just by removing the line "mls qos" from the config of the 3560 access switches!

Fact is that we still have QoS running on the 6500 core with identical settings. It appears that QoS is handled very differently on differing platforms. I did know that this was the case but still the implications of having this single line in the config -without any policiers or other settings in place- strikes me as vague and weird. It certainly looks like I have some additional reading to do on the subject at which I felt quite comfortable.


Leo

andrew.butterworth Wed, 03/26/2008 - 09:25
User Badges:
  • Gold, 750 points or more

What you are seeing is detailed in Bug CSCsc96037. This isn't a bug as such, but the default behaviour when QoS is enabled. Due to the way the hardware is, with the default egress queue settings you are seeing drops due to thresholds being reached when QoS is enabled. The code was modified in 12.2(25)SEE1 and later to allow more of the common pool buffers to be used by each of the egress queues. You do however need to modify the configuration:


mls qos queue-set output 1 threshold X 3200 3200 100 3200


Where 'X' is the queue you want to change.


Andy



lgijssel Wed, 03/26/2008 - 09:36
User Badges:
  • Red, 2250 points or more

Thank you, I will check the bug description.

Removing qos entirely solved the problem for now but we will need this when qos needs to be turned back on.


Thanks a lot!


Leo

lgijssel Wed, 03/26/2008 - 09:36
User Badges:
  • Red, 2250 points or more

Thank you, I will check the bug description.

Removing qos entirely solved the problem for now but we will need this when qos needs to be turned back on.


Thanks a lot!


Leo

Actions

This Discussion