Ethernet output overruns

Unanswered Question

Could someone explain what happens to a network flow that encounters an output overrun?


I had a report of a file copy between 2 servers in the same 6509 chassis was slow over the weekend. It took 14 hours to copy where it usually takes 4. I see that there are 32 overruns with 10 queue drops. It looks like this is a case where the NIC on the server couldn't keep up but what effect does that have on the overall network performance? Does flow control start chopping the window size? Is 32 a high number?


Thanks in advance,

Greg


  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (1 ratings)
Loading.
balajitvk Mon, 07/30/2007 - 07:30
User Badges:
  • Silver, 250 points or more

Hi,


Do see the interface speed between servers and switch ports.... there might be a mismatch it seems....


No.in Overruns or Output errors shows Number of times that the receiver hardware was unable to hand received data to a hardware buffer because the input rate exceeded the receiver's ability to handle the data.


Rgs,

Jagdeep Gambhir Mon, 07/30/2007 - 07:31
User Badges:
  • Red, 2250 points or more

If the capacity of the interface is exceeded, the frame that is currently being received

is dropped and the overrun counter is incremented.


Each network interface on Cisco routers consists of a chipset for converting signals

received in the media into bits of information, and a small packet buffer into which this information is stored before being copied into the I/O memory. On some interface types, this chipset and packet buffer cannot handle a long burst of frames. Such interfaces are meant to provide connectivity to a certain network type, and not to switch packets at line rate. The line rate of these interfaces is often higher than the switching capacity of the router. Therefore, building an interface that receives more traffic than the router can

handle only increases the cost, without adding any real value to the router architecture.


In a small number of cases, the overrun counter may be incremented because of a software defect. However, in the majority of cases, it indicates that the receiving capability of the interface was exceeded. Nothing can be done on the router that reports overruns. If possible, the rate that frames are coming should be controlled at the remote end of the connection. Otherwise, if the number of overruns is high, the hardware should be upgraded.


There are a number of troubleshooting steps you can take to find out what traffic is being processed and backing up the input queue. The section "Input Queue Drops" in the document linked below should give all the steps needed to find this.



http://www.cisco.com/en/US/products/hw/routers/ps133/products_tech_note09186

a0080094791.shtml


However, the input queue drops may just be a result of the same bursty traffic that causes

the overruns. Some traffic by nature just has to be processed through the input queue,

and at those times it may be dropped.



Regards,

~JG

Does that mean that the port input queue received more data than the output queue can handle and has nothing to do with the servers NIC?


We're using 6748-GE-TX line cards so I am surprised to think that the card can't keep up. The 2 servers are in different line cards but are in the same chassis.


Thanks again.

Greg

Edison Ortiz Mon, 07/30/2007 - 08:06
User Badges:
  • Super Bronze, 10000 points or more
  • Hall of Fame,

    Founding Member

Let's see the output from typing


show mod

and

show fabric



Edison Ortiz Mon, 07/30/2007 - 08:36
User Badges:
  • Super Bronze, 10000 points or more
  • Hall of Fame,

    Founding Member

Your config is ideal as you are running in Compact mode and every single module is fabric.


Have you checked for speed/duplex mismatch between the switch and the servers ?


Is this a new configuration or was it working before w/o problems ?


Both ports are configured as 1000/Full. The config is pretty new, about 4 weeks old. This data transfer is happening every weekend so I am trying to verify if this has happened recently or if this is the 1st time.


I'm trying to make sure that I understand though before sending another email to the Data Center Team. I think I misinformed them earlier because I thought the output overruns were because the NIC on the server couldn't handle anymore packets. I am now under the impression that the overruns are where the switchport received packets faster on the ingress port than what could be sent on the egress port. Is this correct?

Edison Ortiz Mon, 07/30/2007 - 10:34
User Badges:
  • Super Bronze, 10000 points or more
  • Hall of Fame,

    Founding Member

Can I see the output from typing


show int [mod/port] stat

show int [mod/port] status


from the affected ports


and

show version



You bet! (I'm only seeing overruns on gi7/48)


sh int gi3/48 stat

GigabitEthernet3/48

Switching path Pkts In Chars In Pkts Out Chars Out

Processor 0 0 43131 19150099

Route cache 0 0 0 0

Distributed cache 0 0 0 0

Total 0 0 43131 19150099

m-server-1#sh int gi3/48 status


Port Name Status Vlan Duplex Speed Type

Gi3/48 connected 157 a-full a-1000 10/100/1000BaseT

m-server-1#

m-server-1#

m-server-1#sh int gi7/48 stat

GigabitEthernet7/48

Switching path Pkts In Chars In Pkts Out Chars Out

Processor 0 0 43092 19132728

Route cache 0 0 0 0

Distributed cache 0 0 0 0

Total 0 0 43092 19132728

m-server-1#sh int gi7/48 status


Port Name Status Vlan Duplex Speed Type

Gi7/48 connected 157 a-full a-1000 10/100/1000BaseT

m-server-1#sh ver

Cisco Internetwork Operating System Software

IOS (tm) s72033_rp Software (s72033_rp-IPSERVICESK9_WAN-VM), Version 12.2(18)SXF6, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2006 by cisco Systems, Inc.

Compiled Tue 19-Sep-06 01:38 by tinhuang

Image text-base: 0x01020150, data-base: 0x01021000


ROM: System Bootstrap, Version 12.2(17r)S4, RELEASE SOFTWARE (fc1)

BOOTLDR:

m-server-1 uptime is 4 weeks, 2 days, 1 hour, 46 minutes

Time since m-server-1 switched to active is 4 weeks, 2 days, 1 hour, 46 minutes

System returned to ROM by reload (SP by reload)

System restarted at 12:55:32 EDT Sat Jun 30 2007

System image file is "disk0:/sys/s72033/base/s72033-ipservicesk9_wan-vm"



This product contains cryptographic features and is subject to United

States and local country laws governing import, export, transfer and

use. Delivery of Cisco cryptographic products does not imply

third-party authority to import, export, distribute or use encryption.

Importers, exporters, distributors and users are responsible for

compliance with U.S. and local country laws. By using this product you

agree to comply with applicable laws and regulations. If you are unable

to comply with U.S. and local laws, return this product immediately.


A summary of U.S. laws governing Cisco cryptographic products may be found at:

http://www.cisco.com/wwl/export/crypto/tool/stqrg.html


If you require further assistance please contact us by sending email to

[email protected].


cisco WS-C6509-E (R7000) processor (revision 1.2) with 491520K/32768K bytes of memory.

Processor board ID SMG1044N3D1

SR71000 CPU at 600Mhz, Implementation 1284, Rev 1.2, 512KB L2 Cache

Last reset from s/w reset

Bridging software.

X.25 software, Version 3.0.0.

SuperLAT software (copyright 1990 by Meridian Technology Corp).

TN3270 Emulation software.

14 Virtual Ethernet/IEEE 802.3 interfaces

196 Gigabit Ethernet/IEEE 802.3 interfaces

4 Ten Gigabit Ethernet/IEEE 802.3 interfaces

1917K bytes of non-volatile configuration memory.


Configuration register is 0x2102


System is currently running from installed software

For further information use "show install running"

Edison Ortiz Mon, 07/30/2007 - 12:28
User Badges:
  • Super Bronze, 10000 points or more
  • Hall of Fame,

    Founding Member

Everything looks good at the switch. Since you mentioned it's only happening on one port and the config on both ports are identical, I recommend looking into Layer 1 between the switch and server. Replace cables as needed and investigate if this was just one case or if it is a chronic problem.

Thanks for your help.


Just to clarify, Are output overruns caused by the ingress queue sending packets faster than the output queue hardware buffer can handle?


My 1st thought was that the server NIC hardware buffer was full but then I read that the counter was only on the port. If this is so, how would layer 1 affect the performance of the egress queue?


Sorry for such a lame question.


Thanks,

Greg

Edison Ortiz Tue, 07/31/2007 - 07:16
User Badges:
  • Super Bronze, 10000 points or more
  • Hall of Fame,

    Founding Member

These are packets leaving the switch so they are placed in the egress queue.


It sounds like the packets leaving the switch were too big or there were too many to fit in some queues hence the overrun condition which in turn caused some queue drops.


It seems the source server was sending packets faster than the destination.


Were both servers running under the same CPU/Load condition when the transfer was taking place ?




Edison Ortiz Wed, 08/01/2007 - 07:10
User Badges:
  • Super Bronze, 10000 points or more
  • Hall of Fame,

    Founding Member

Yes.

Actions

This Discussion