6500 packet loss on blade WS-X6748-GE-TX

Unanswered Question
Oct 3rd, 2011

Hi,

we have a 6509 with ios 12.2.33SXJ

we have 2 WS-X6516-GE-TX, WS-X6516A-GBIC, and a WS-X6748-GE-TX with a WS-F6700-CFC daughtercard

our sup is a WS-SUP720-3B

we are experiencing packet loss for everything connected in the WS-X6748-GE-TX blade, right now we dont have any production device in that blade due to the packet loss we are experiencing.

does anyone have encountered the same problem.

this switch was running hybrid  before it is now running native ios, however I can't recall if we didn't have that packet loss before.

do i need to update a firmware of the card or daughtercard (if this is possible, can't say i've done it before).

thank you

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Average Rating: 0 (0 ratings)
reseaucsp Tue, 10/04/2011 - 10:45

I read the release note of 12.2sx

seems like the ROMMON on the WS-F6700-CFC daughtercard was not up to date. I updated it to  12.2(18r)S1 like the release note suggested. however it did not resolved my problem, i'm still experiencing packet lost for devices connected in this blade.

right now the blade is in slot 9 of our 6509. I could put it in blade 1, 2 or 3. would it change something?

thank you

Jon Marshall Tue, 10/04/2011 - 11:35

Sylvain

The 6748 module has 2 x 20Gbps connections to the switch fabric. It has 48 10/100/1000Gbps ports. So in theory you can oversubscribe this module but it is unlikely as you would need over 40 ports, or more specifically more than 20 ports per port group to be transmitting 1Gbps simualtenously which is unlikely.

Just to clarify the port group thing. The 6748 as 2 port groups -

group1 = ports 1 - 24

group2 = ports 25 - 48

each port group has access to a 20Gbps connection to the switch fabric.

So if you have more than 20 connected devices per port group transmitting 1Gbps each simultaneously then you do have oversubscription. But as i say this is highly unlikely.

Moving the module to a different slot in the 6509 should make no difference as each each slot provides a maximum of 40Gbps per slot.

Is there any possibility you have enabled QOS but not tuned the buffers accordingly ? Where is the packet loss ie. ingress to the ports or egress from the ports ?

Jon

Leo Laohoo Tue, 10/04/2011 - 14:33

1.  Post the "sh interface " and what is the uptime of the chassis?

2.  Can you also post "sh interface count error" please?

reseaucsp Wed, 10/05/2011 - 12:02

the qos could be the problem i guess, before there was the command: mls qos 

while this command was on the switch we experienced packet loss and a delay for our ping,

then we disabled this command, but we still had packet loss but did not have delay anymore

is there a document that could help us configure the qos for this blade?

heres a show interface

we have the problem in all the port of the 6748

for the uptime, the 6500 was updated this weekend soo about 4 days.

thank you

GigabitEthernet9/1 is up, line protocol is up (connected)

  Hardware is C6k 1000Mb 802.3, address is 0016.c810.75c0 (bia 0016.c810.75c0)

  Description:

  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,

     reliability 255/255, txload 1/255, rxload 1/255

  Encapsulation ARPA, loopback not set

  Keepalive set (10 sec)

  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT

  input flow-control is off, output flow-control is on

  Clock mode is auto

  ARP type: ARPA, ARP Timeout 04:00:00

  Last input never, output 00:00:38, output hang never

  Last clearing of "show interface" counters never

  Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 5

  Queueing strategy: fifo

  Output queue: 0/40 (size/max)

  5 minute input rate 0 bits/sec, 0 packets/sec

  5 minute output rate 51000 bits/sec, 14 packets/sec

     185714 packets input, 64272078 bytes, 0 no buffer

     Received 2873 broadcasts (0 multicasts)

     0 runts, 0 giants, 0 throttles

     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

     0 watchdog, 0 multicast, 0 pause input

     0 input packets with dribble condition detected

     4769209 packets output, 2225784348 bytes, 0 underruns

     0 output errors, 0 collisions, 4 interface resets

     0 babbles, 0 late collision, 0 deferred

     0 lost carrier, 0 no carrier, 0 PAUSE output

     0 output buffer failures, 0 output buffers swapped out

show interfaces gigabitEthernet 9/1 counters errors

Port                     Align-Err             FCS-Err            Xmit-Err             Rcv-Err           UnderSize         OutDiscards

Gi9/1                            0                   0                   0                   0                   0                   2

Port                    Single-Col           Multi-Col            Late-Col          Excess-Col           Carri-Sen               Runts              Giants

Gi9/1                            0                   0                   0                   0                   0                   0                   0

Port                   SQETest-Err         Deferred-Tx        IntMacTx-Err        IntMacRx-Err          Symbol-Err

Gi9/1                            0                   0                   0                   0                   0

johnnylingo Fri, 02/10/2012 - 00:16

Also seeing output drops, but mine are to a Linux server on a WS-X6748-GE-TX blade.  The drops occur when the server reads from a NAS, which has a 10GB connection. Unfortunately, the application does a poor job of handling the drops, and does not support rate limiting.  Also, both the server and NAS are on the same subnet, so implementing Layer 3 QoS is not an option.

Is there a good work-around for this scenario?  Would flow control help?  Or should I look in to increasing the buffer sizes of Queue #1? 

#show int gig7/27

  Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 261639

# show queueing int gig7/27

Packets dropped on Transmit:
    BPDU packets:  0

    queue              dropped  [cos-map]
    ---------------------------------------------
    1                   261639  [0 1 ]
    2                        0  [2 3 4 ]
    3                        0  [6 7 ]
    4                        0  [5 ]

alexkosykh Thu, 11/22/2012 - 20:57

We have same problem on 6509. IOS s72033-advipservicesk9_wan-mz.122-33.SXH2

Sylvain, do you solved problem?

nkarpysh Thu, 11/22/2012 - 21:39

Hello Gents,

There are few possible reasons for these kind of problems:

- Pure oversubscription - when several port or Higher speed port sending traffic out of single lower speed port. Line wont be able to send all and start to drop

- QoS tuning is not efficient

- Remote side sending flow control pause frames as it cant handle traffic that fast

- HW problem

- etc

I would recomend to start checking from first one. If you suspect drops - understand first what is traffic coming out of that port, where it is coming from to the switch. Check if oversubscription is happening. Keep in mind module architecture and it's internal oversubscription limits. Check output drops on the interface with "show int" command

For second point - if you suspect QoS, try disabling QoS globally first during MW and see if that improves situatuion then you can TS QoS further if Yes:

http://www.cisco.com/en/US/partner/products/hw/switches/ps708/products_tech_note09186a008074d6b1.shtml

3rd - please check show int and see if Pause counter incrementing - if yes, check the problem on remote side.

4th - try moving link within ports on same LC, different ASIC on same LC, different LC and notice how the drops behave. You can make good decisions based on that.

Please don't hesistate to open TAC acse for this kind of problems to verify it in more details. Each situatuion might be very different so common approach does not work well here for all.

Nik

reseaucsp Fri, 11/23/2012 - 06:33

we solved our problem,

for us this seem like a hardware problem, we contacted TAC and they replaced it no problem,

we have not experienced the problem ever since.

alexkosykh Fri, 11/23/2012 - 02:28

Today we moved our links from 1-20 to 25-44 ports. It works!!!

We checked the ports with only one link on the same time from pc to blade. Just move from port to port and ping from PC to Cisco and vice versa. From 1 to 24 ports we saw packets loss.

#ping 192.168.15.150

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.15.150, timeout is 2 seconds:

!.!!.

Success rate is 60 percent (3/5), round-trip min/avg/max = 1/2/4 ms

From 25 to 48 ports works fine without loss.

I saw from tcpdump that all pings from cisco comes to PC. But cisco didn't saw the answers from PC.

13:42:09.919744 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 0, length 80

13:42:09.919758 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 0, length 80

13:42:09.921342 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 1, length 80

13:42:09.921349 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 1, length 80

13:42:11.920571 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 2, length 80

13:42:11.920582 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 2, length 80

13:42:11.921051 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 3, length 80

13:42:11.921058 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 3, length 80

13:42:11.921456 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 4, length 80

13:42:11.921462 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 4, length 80

This is my test config

interface Vlan7

description 6748 test

ip address 192.168.15.149 255.255.255.252

end

interface GigabitEthernet9/23

description test

switchport

switchport access vlan 7

switchport mode access

spanning-tree portfast

end

Mod Ports Card Type                              Model

--- ----- -------------------------------------- ------------------

1   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC

2   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC

3   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC

4   16  16 port 1000mb MTRJ ethernet           WS-X6416-GE-MT

5    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE

6    2  Supervisor Engine 720 (Active)         WS-SUP720-3B

7   24  24 port 100FX Multi mode               WS-X6324-100FX-MM

8   48  SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-TX

9   48  CEF720 48 port 10/100/1000mb Ethernet  WS-X6748-GE-TX

Why is ports from 1 to 24 work with packets loss?

nkarpysh Mon, 11/26/2012 - 19:24

Hi Alexander,

The problem might be related to load on the ASICs corresponding to those ports 1-24. Some of other links can already carry traffic on link spead. Oversubscription on this module is 1.2:1 meaning that 12 ports sharing 10G ASIC. So if all send traffic on line rate - you will have drops.

Also nothing excluding the bad port NIC - so you can see if moving the link to some other port withing first 24 also solves the problem. Then it would mean some HW problems on single port/ group of port and their ASIC rohini or or ASIC Janus for group of 24 ports.

Nik

Actions

Login or Register to take actions

This Discussion

Posted October 3, 2011 at 7:39 AM
Stats:
Replies:10 Avg. Rating:
Views:2788 Votes:0
Shares:0
Tags: No tags.

Discussions Leaderboard

Rank Username Points
1 15,007
2 8,150
3 7,725
4 7,083
5 6,742
Rank Username Points
165
82
70
69
55