cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1527
Views
0
Helpful
12
Replies

Nexus 5548 FCoE and QLogic 8100

jhowell02
Level 1
Level 1

Hello, I have a SPARC T4 running Solaris 11 that is doing 10GB Ethernet and FCoE over a QLogic 8142 CNA. The SAN is connected to a Nexus 5548 and so is the CNA. I've reviewed the configuration and both the switch and the card seem to be configured for PFC on FCoE but for some reason when I examine the traffic from the card, I see that the Pause Frames are pausing all traffic Priorities 0-7 and not just Priority 3. This is wrecking the throughput on the server for things that are I/O intensive on both the disks and the network (i.e. backups). I really need to get it fixed but am not entirely sure where to start. The guy that runs the switch swears it is configured correctly and provided me this configuration:

    

version 5.0(2)N1(1)

class-map type qos class-fcoe

class-map type qos match-any class-default

class-map type queuing class-fcoe

  match qos-group 1

class-map type queuing class-default

  match qos-group 0

policy-map type qos default-in-policy

  class class-fcoe

    set qos-group 1

  class class-default

    set qos-group 0

policy-map type queuing default-in-policy

  class type queuing class-fcoe

  class type queuing class-default

policy-map type queuing default-out-policy

  class type queuing class-fcoe

  class type queuing class-default

class-map type network-qos class-fcoe

  match qos-group 1

class-map type network-qos class-default

  match qos-group 0

policy-map type network-qos jumbo

  class type network-qos class-fcoe

    pause no-drop

    mtu 2158

  class type network-qos class-default

    mtu 9216

policy-map type network-qos default-nq-policy

  class type network-qos class-fcoe

    pause no-drop

    mtu 2158

  class type network-qos class-default

system qos

  service-policy type queuing input default-in-policy

  service-policy type queuing output default-out-policy

  service-policy type qos input default-in-policy

  service-policy type network-qos jumbo

  fex queue-limit

            

This differs slightly from what the QLogic document says, but I understand conceptually why it should work... It just doesn't. They think it's my card but I don't understand how if DCBX negotiates the settings how I could possibly change anything? I can see the TLV and it says that PFC is on for priority 3 and off for everything else. I can also see that traffic is getting tagged via priority. It's just the pause frames, which apply evenly to all priorities. When I watch them increment it's always the same number for priorities 0-7.

Anyway, I attached a picture of the traffic that I see from the card side. Maybe I am misinterpreting it, so I'd be glad for another set of eyes.

Any ideas are greatly appreciated. Thanks!

12 Replies 12

Amit Singh
Cisco Employee
Cisco Employee

Hi Jessie,

This could be a good start.

http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/troubleshooting/guide/n5K_ts_fcoe.html#wp1026358

I dont have much experience on this, but I am happy to troubleshoot it further and see if we can pull it off together.

What type of connecivity is to Storage FC or FCOE? How many links and in case of FC what is the FC bandwidth?\\Is this the only server or you have more servers connected to the Nexus 5500.

Cheers,

-amit singh

Thank you for your response Amit.

The connection from the 5k to the storage is Native Fiber Channel and the connection from the 5k to the server is converged 10g/FCoE.

The SAN is a Sun StorageTek 9990v and it it has 4gb/sec fiber channel interfaces.

Every server that has 10gb ethernet is connected to the Nexus so there are probably 30ish other servers connected to it. As far as I know, I am the only one though who utilizes the CNA with the 10gb/FCoE on the same optical cable.

I am working on getting the network guy to examine the DCBX and pause frame information on the switch. He seems reluctant to engage Cisco support but I have to imagine they've seen this symptom before. Thanks again!

I also found out this on Nexus5K. The FCoE is classified into a class called "qos-group 1" and also the FCOE frames are marked with COS value 3. The priority flow control is only designed for class FCOE frames hence when the PFC pause frame is sent out it is also marked with COS 3 and it will leave from qos-group 3 on the switch. On the server side, if it is seen on many or all 8 configured group then probably you will check on the server side setting on any relation between those groups with respect to COS 3.

My suggestion would be not to concentrate much on Pause frames but more on congestion between 5K and Storage ports.

Hope this helps.

The problem with CNAs is that there are not server side settings. In fact, the initial communication is done via the firmware loaded on the CNA and the switch. The settings are negotiated via the DCBX protocol. At the OS level, I see a standard HBA and a standard NIC.

Regarding the Pause frames, I should probably clarify that because of them being applied to all priorities, both the Ethernet and FCoE performance suffer because all traffic is pausing rather than just FCoE. So take a backup for instance where data is being read up off of LUNs and then sent to the backup server via ethernet. The performance slows to a crawl because it's all constantly pausing. This isn't supposed to happen if PFC is working correctly because ethernet should still drop packets if it needs to and not pause because it is not in the priority 3 class of service.

Hi Jessie,

That's why i mentioned that its probably CNA sending pause frames for all classes. Nexus5K will only set the default COS value of 3 for FCOE and will only send FCOE pause frame in that class only.

Did you get a chance to speak with Q-Logic or Oracle (Its an Oracle re-branded CNA) on how to disable pause frames for unwanted classes or enable pause frames only on COS 3.

I have looked into some TAC cases on this. One of the customer had the same hardware config as yours using the QLogic CNA. They were hitting the same symptoms. TAC ended up advising customer to check the CNA settings with QLogic to enable pause frames only on COS 3 and disable other. They also advised custome to increase the FC bandwidth to the storage. Customer ended up adding additional FC links between 5K and storage to eliviate the issue.

Cheers,

-amit singh

Check if the interface towards the server has

flowcontrol receive on

flowcontrol send on

or something like that configured.

The following documentation seems to suggest that this removes DCB behaviour from the CNA:

http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/troubleshooting/guide/n5K_ts_fcoe.html#wp1026403

Thanks Amit for all of your help it is appreciated.

The CNA itself is not sending pause frames. The CNA has plenty of bandwidth and does not want to slow down. The SAN will run low on B2B credits under load (as expected) and send pause frames because it needs to slow down. Normally these should be only be sent to Priority 3 as per PFC, but as we've been discussing they are instead being applied to the all priorities which kicks off sort a negative feedback loop resulting in a dismal throughput rate.

gnijs, thank you as well. While I haven't seen for myself if standard flow control is on, the switch guys has told me it is not. I also think that since I can see that DCB TLVs have been exchanged with the card that PFC should be on.

What I was really hoping to find out was if anyone had dealt with these cards somehow needing more than just the non-default configuration to properly operate. The QLogic guide suggests having more than just the default two qos groups - adding another class-nic for only priority 0 traffic, giving it 50% queuing priority and making it no-drop as well, and then dropping the class-default to 0 queuing priority. It doesn't really elaborate on whether or not it has to be set up that way, but I'm just wondering if maybe this particular CNA needs at least 2 no-drop priorites for PFC to work properly on it?

Thanks again guys.

Hi Jessie,

Where did you get that info from? I went to Qlogic website and even Oracle website but could not find much except the datasheet. I did not read the datasheet completely though. Let me know if its in the datasheet, I will check more info on this and see if this makes sense. You can let me know if there is a different document you are reading. I will discuss this with my peers and see what make sense.

Cheers,

-amit singh

http://filedownloads.qlogic.com/files/Manual/77172/Install_Guide_Unified_Fabric_Pilot.pdf

This is the guide that QLogic referred us to. The part I am referencing is page 24 (section 4-6).

Comparing their config to the config in my original post, it looks a bit different.

Qlogic recommends:

class-map type qos class-nic

match cos 0

class-map type qos class-fcoe

match cos 3

class-map type queuing class-nic

match qos-group 2

class-map type network-qos class-nic

match qos-group 2

policy-map type qos policy1

class type qos class-nic

set qos-group 2

policy-map type queuing policy1

class type queuing class-nic

bandwidth percent 50

class type queuing class-fcoe

bandwidth percent 50

class type queuing class-default

bandwidth percent 0

policy-map type network-qos policy1

class type network-qos class-nic

pause no-drop

system qos

service-policy type qos input policy1

service-policy type queuing output policy1

service-policy type queuing input policy1

service-policy type network-qos policy1

ok Amit, thanks again for all of your help! I finally heard from my switch guy and it turns out that DCBX is not in fact enabled. The TLVs that I am seeing on my card are "desired" - which I wish QLogic did a better job of making that intuitive. I am actually very surprised that this machine is even able to see the LUNs or do any FCoE at all, but I've seen stranger things.

Once we get this straightened out, I am pretty confident at this point that we should have it working properly. Again, thank you for your help! I will update this thread for any others who may be reading it with the final resolution.

One thing that may be helpful to people in my situation, is that if you install the QConvergeConsole Command Line Tools, you can run this command "qaucli -pr nic -idcbx 1" (replace the 1 with the port you want to look at) it will give you the DCBX negotiation result right from your card. Thanks to QLogic for pointing me towards this application as I had been using the GUI tool and it's presentation was a bit hard to make sense of.

Many thanks Jessie for the update.  Appreciate you are putting evry bit of details for others to understand it fully.

Thanks you once again and keep us posted on the results and any new observations.

Cheers,

-amit singh

Well unfortunately I'm still in the same boat. The DCBX now has no negotiation failures, the switch and the card agree on the no-drop class for fcoe, they agree everything else is lossy, but I am still pausing on all CoS as if standard flow control is somehow on or multicast traffic is set to no drop (which I was assured it is not).

The policy I posted in my OP, should be just fine. So I'm now sort of looking for abstract scenarios that could cause this behavior.

I did notice that the card is set for LACP (channel-group mode active). The card has two ports, connected to nexus 1 and nexus 2 respectively, and they are both mapped to the same FCOE VLAN - which I think is odd because usually you'd have two seperate VLANs for A and B. I don't know if that could cause issues of this type, but that is probably what I am going to work on reviewing next.