6500 packet loss on blade WS-X6748-GE-TX

Unanswered Question
Oct 3rd, 2011
User Badges:

Hi,


we have a 6509 with ios 12.2.33SXJ


we have 2 WS-X6516-GE-TX, WS-X6516A-GBIC, and a WS-X6748-GE-TX with a WS-F6700-CFC daughtercard


our sup is a WS-SUP720-3B


we are experiencing packet loss for everything connected in the WS-X6748-GE-TX blade, right now we dont have any production device in that blade due to the packet loss we are experiencing.


does anyone have encountered the same problem.


this switch was running hybrid  before it is now running native ios, however I can't recall if we didn't have that packet loss before.


do i need to update a firmware of the card or daughtercard (if this is possible, can't say i've done it before).


thank you

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Sylvain Deschenes Tue, 10/04/2011 - 10:45
User Badges:

I read the release note of 12.2sx


seems like the ROMMON on the WS-F6700-CFC daughtercard was not up to date. I updated it to  12.2(18r)S1 like the release note suggested. however it did not resolved my problem, i'm still experiencing packet lost for devices connected in this blade.


right now the blade is in slot 9 of our 6509. I could put it in blade 1, 2 or 3. would it change something?


thank you

Jon Marshall Tue, 10/04/2011 - 11:35
User Badges:
  • Super Blue, 32500 points or more
  • Hall of Fame,

    Founding Member

  • Cisco Designated VIP,

    2017 LAN, WAN

Sylvain


The 6748 module has 2 x 20Gbps connections to the switch fabric. It has 48 10/100/1000Gbps ports. So in theory you can oversubscribe this module but it is unlikely as you would need over 40 ports, or more specifically more than 20 ports per port group to be transmitting 1Gbps simualtenously which is unlikely.


Just to clarify the port group thing. The 6748 as 2 port groups -


group1 = ports 1 - 24

group2 = ports 25 - 48


each port group has access to a 20Gbps connection to the switch fabric.


So if you have more than 20 connected devices per port group transmitting 1Gbps each simultaneously then you do have oversubscription. But as i say this is highly unlikely.


Moving the module to a different slot in the 6509 should make no difference as each each slot provides a maximum of 40Gbps per slot.


Is there any possibility you have enabled QOS but not tuned the buffers accordingly ? Where is the packet loss ie. ingress to the ports or egress from the ports ?


Jon

Leo Laohoo Tue, 10/04/2011 - 14:33
User Badges:
  • Super Gold, 25000 points or more
  • Hall of Fame,

    The Hall of Fame designation is a lifetime achievement award based on significant overall achievements in the community. 

  • Cisco Designated VIP,

    2017 LAN, Wireless

1.  Post the "sh interface " and what is the uptime of the chassis?

2.  Can you also post "sh interface count error" please?

Sylvain Deschenes Wed, 10/05/2011 - 12:02
User Badges:

the qos could be the problem i guess, before there was the command: mls qos 


while this command was on the switch we experienced packet loss and a delay for our ping,


then we disabled this command, but we still had packet loss but did not have delay anymore


is there a document that could help us configure the qos for this blade?


heres a show interface


we have the problem in all the port of the 6748


for the uptime, the 6500 was updated this weekend soo about 4 days.


thank you



GigabitEthernet9/1 is up, line protocol is up (connected)

  Hardware is C6k 1000Mb 802.3, address is 0016.c810.75c0 (bia 0016.c810.75c0)

  Description:

  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,

     reliability 255/255, txload 1/255, rxload 1/255

  Encapsulation ARPA, loopback not set

  Keepalive set (10 sec)

  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT

  input flow-control is off, output flow-control is on

  Clock mode is auto

  ARP type: ARPA, ARP Timeout 04:00:00

  Last input never, output 00:00:38, output hang never

  Last clearing of "show interface" counters never

  Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 5

  Queueing strategy: fifo

  Output queue: 0/40 (size/max)

  5 minute input rate 0 bits/sec, 0 packets/sec

  5 minute output rate 51000 bits/sec, 14 packets/sec

     185714 packets input, 64272078 bytes, 0 no buffer

     Received 2873 broadcasts (0 multicasts)

     0 runts, 0 giants, 0 throttles

     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

     0 watchdog, 0 multicast, 0 pause input

     0 input packets with dribble condition detected

     4769209 packets output, 2225784348 bytes, 0 underruns

     0 output errors, 0 collisions, 4 interface resets

     0 babbles, 0 late collision, 0 deferred

     0 lost carrier, 0 no carrier, 0 PAUSE output

     0 output buffer failures, 0 output buffers swapped out


show interfaces gigabitEthernet 9/1 counters errors



Port                     Align-Err             FCS-Err            Xmit-Err             Rcv-Err           UnderSize         OutDiscards

Gi9/1                            0                   0                   0                   0                   0                   2



Port                    Single-Col           Multi-Col            Late-Col          Excess-Col           Carri-Sen               Runts              Giants

Gi9/1                            0                   0                   0                   0                   0                   0                   0



Port                   SQETest-Err         Deferred-Tx        IntMacTx-Err        IntMacRx-Err          Symbol-Err

Gi9/1                            0                   0                   0                   0                   0

johnnylingo Fri, 02/10/2012 - 00:16
User Badges:
  • Bronze, 100 points or more

Also seeing output drops, but mine are to a Linux server on a WS-X6748-GE-TX blade.  The drops occur when the server reads from a NAS, which has a 10GB connection. Unfortunately, the application does a poor job of handling the drops, and does not support rate limiting.  Also, both the server and NAS are on the same subnet, so implementing Layer 3 QoS is not an option.

Is there a good work-around for this scenario?  Would flow control help?  Or should I look in to increasing the buffer sizes of Queue #1? 

#show int gig7/27

  Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 261639

# show queueing int gig7/27

Packets dropped on Transmit:
    BPDU packets:  0

    queue              dropped  [cos-map]
    ---------------------------------------------
    1                   261639  [0 1 ]
    2                        0  [2 3 4 ]
    3                        0  [6 7 ]
    4                        0  [5 ]

alexkosykh Thu, 11/22/2012 - 20:57
User Badges:

We have same problem on 6509. IOS s72033-advipservicesk9_wan-mz.122-33.SXH2

Sylvain, do you solved problem?

nkarpysh Thu, 11/22/2012 - 21:39
User Badges:
  • Cisco Employee,

Hello Gents,


There are few possible reasons for these kind of problems:


- Pure oversubscription - when several port or Higher speed port sending traffic out of single lower speed port. Line wont be able to send all and start to drop

- QoS tuning is not efficient

- Remote side sending flow control pause frames as it cant handle traffic that fast

- HW problem

- etc


I would recomend to start checking from first one. If you suspect drops - understand first what is traffic coming out of that port, where it is coming from to the switch. Check if oversubscription is happening. Keep in mind module architecture and it's internal oversubscription limits. Check output drops on the interface with "show int" command


For second point - if you suspect QoS, try disabling QoS globally first during MW and see if that improves situatuion then you can TS QoS further if Yes:

http://www.cisco.com/en/US/partner/products/hw/switches/ps708/products_tech_note09186a008074d6b1.shtml


3rd - please check show int and see if Pause counter incrementing - if yes, check the problem on remote side.


4th - try moving link within ports on same LC, different ASIC on same LC, different LC and notice how the drops behave. You can make good decisions based on that.



Please don't hesistate to open TAC acse for this kind of problems to verify it in more details. Each situatuion might be very different so common approach does not work well here for all.


Nik

Sylvain Deschenes Fri, 11/23/2012 - 06:33
User Badges:

we solved our problem,


for us this seem like a hardware problem, we contacted TAC and they replaced it no problem,


we have not experienced the problem ever since.

alexkosykh Fri, 11/23/2012 - 02:28
User Badges:

Today we moved our links from 1-20 to 25-44 ports. It works!!!

We checked the ports with only one link on the same time from pc to blade. Just move from port to port and ping from PC to Cisco and vice versa. From 1 to 24 ports we saw packets loss.


#ping 192.168.15.150


Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.15.150, timeout is 2 seconds:

!.!!.

Success rate is 60 percent (3/5), round-trip min/avg/max = 1/2/4 ms


From 25 to 48 ports works fine without loss.

I saw from tcpdump that all pings from cisco comes to PC. But cisco didn't saw the answers from PC.


13:42:09.919744 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 0, length 80

13:42:09.919758 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 0, length 80

13:42:09.921342 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 1, length 80

13:42:09.921349 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 1, length 80

13:42:11.920571 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 2, length 80

13:42:11.920582 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 2, length 80

13:42:11.921051 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 3, length 80

13:42:11.921058 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 3, length 80

13:42:11.921456 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 4, length 80

13:42:11.921462 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 4, length 80


This is my test config


interface Vlan7

description 6748 test

ip address 192.168.15.149 255.255.255.252

end


interface GigabitEthernet9/23

description test

switchport

switchport access vlan 7

switchport mode access

spanning-tree portfast

end



Mod Ports Card Type                              Model

--- ----- -------------------------------------- ------------------

1   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC

2   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC

3   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC

4   16  16 port 1000mb MTRJ ethernet           WS-X6416-GE-MT

5    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE

6    2  Supervisor Engine 720 (Active)         WS-SUP720-3B

7   24  24 port 100FX Multi mode               WS-X6324-100FX-MM

8   48  SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-TX

9   48  CEF720 48 port 10/100/1000mb Ethernet  WS-X6748-GE-TX



Why is ports from 1 to 24 work with packets loss?

nkarpysh Mon, 11/26/2012 - 19:24
User Badges:
  • Cisco Employee,

Hi Alexander,


The problem might be related to load on the ASICs corresponding to those ports 1-24. Some of other links can already carry traffic on link spead. Oversubscription on this module is 1.2:1 meaning that 12 ports sharing 10G ASIC. So if all send traffic on line rate - you will have drops.


Also nothing excluding the bad port NIC - so you can see if moving the link to some other port withing first 24 also solves the problem. Then it would mean some HW problems on single port/ group of port and their ASIC rohini or or ASIC Janus for group of 24 ports.


Nik

Actions

This Discussion