Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Announcements

Welcome to Cisco Support Community. We would love to have your feedback.

For an introduction to the new site, click here. And see here for current known issues.

New Member

ucs 2.2(1d) leads to vmware network outtages

Since the update to 2.2(1d) we have network timeouts / outtages between vsphere vm's hosted on ucs b-series servers and the rest of the world outside vmware / ucs. first occured with sles10 guests, but windows 2008 and more are also involved. any idas or similar experiences?

21 REPLIES
New Member

1) What version did you

1) What version did you update from?

2) What version of ESXi, and what version of fNIC/eNIC drivers are you running?

3) Are your blade firmwares and CIMC standardized onto 2.2(1d)?

New Member

I updated from 2.1(3a) and

I updated from 2.1(3a) and the ESXi is 5.5u1 with the NFS APD Hotfix and all other fixes.

UCS-manager, FI, Chassis and all Blades are at the same firmware level 2.2(1d).

fNic driver information:

vmkload_mod -s fnic
vmkload_mod module information
 input file: /usr/lib/vmware/vmkmod/fnic
 Version: Version 1.5.0.45, Build: 1198611, Interface: 9.2 Built on: Jul 31 2013
 License: GPLv2
 Name-space: com.cisco.fnic#9.2.2.0
 Required name-spaces:
  com.vmware.libfcoe#9.2.2.0
  com.vmware.libfc#9.2.2.0
  com.vmware.driverAPI#9.2.2.0
  com.vmware.vmkapi#v2_2_0_0

 

eNic driver information:

ethtool -i vmnic0
driver: enic
version: 2.1.2.38
firmware-version: 2.2(1d)
bus-info: 0000:08:00.0

 

 

New Member

That's pretty rough, I'd open

That's pretty rough, I'd open a TAC case at this point, something is not right.

VIP Green

Is this the hotfix that you

Is this the hotfix that you are referring to

http://kb.vmware.com/kb/2076392

Intermittent NFS APDs on VMware ESXi 5.5 U1 (2076392)

Up to very recently, VMware didn't have a fix !

 

 

New Member

yes, this hotfix i meant. 

I mean hotfix KB2077360:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2077360

it's the one for esxi 5.5u1

New Member

this moment I upgraded to

this moment I upgraded to eNIC driver 2.1.2.42

let's see what happens...

New Member

seems that the lags /

seems that the lags / connection disrupts still exist.

error-counters and such are all zero.

sniffing the networktraffic results in sent packages to the vm-machines but they don't answer them. no aborts in servicelayer or so.

 

New Member

Have you found a resolution

Have you found a resolution to this issue yet? we are experiencing something of the same issue...

Thanks

New Member

No solution yet.Actually I've

No solution yet.

Actually I've got open Servicerequests at Cisco and VMware site.

Cisco didn't do any action yet. VMware analysis my ESXi Logbundles.

Mainly affected are SLES10 and Win2008(R2) VM guests.

Could you tell me what your issues are? Maybe it helps me to get any clue what might be going on here. STill not sure if it's an Cisco or VMware related problem, since if've patched two things at the same time. I should have known...

 

VIP Green

are you using vswitch, DVS,

are you using vswitch, DVS, N1k ?

New Member

no. old style vlan tagged

no. old style vlan tagged host networking.

New Member

Did you happen to have any

Did you happen to have any success on this yet?  We've been battling the same thing for some time and not getting anywhere with our cases.

New Member

VMware and Cisco have no

VMware and Cisco have no solution yet.

I've found out, that the MAC addresses of those affected VM guests switch / flap

between our two uplinks of the same(!) fabric-interconnect. It's seen on our upstream 6500 series coreswitches/-routers. Flapping begins after about 50 seconds of inactivity which is the default timeout for mac-learning in spanning-tree / STP. If I do pingloops from the affected VMs to outside targets this doesn't occur and they're reachable again. It's 100% reproduceable. It's clear that if the MAC comes up on the wrong (not pinned) uplink the traffic from the 6500 to the FI goes to nowhere, that's the normal endhost-mode behavior.

But I've set pingroups / portpinning for those ESXi vNIC adapters in UCSM so that this behavior should be forbidden by config.

But it happens. Flapping between the pinports of fi_a and fi_b I'd imagine, but it's between my two uplinks for _one_ FI.

Could you confirm that?

Maybe this would help to get Cisco a hint on this.
I'm strongly in the mood to downgrade to my old 2.1(3a) again, but actually I must not change anything in UCSM by Cisco support advice. Annoying!

 

 

New Member

I'm sitting on the phone with

I'm sitting on the phone with Cisco at this very moment, and have confirmed the same thing.  Its being escalated to the business unit (the coders) as a sev 1 for a workaround and have new code created to address this.  I'll post anything I hear.

 

We also got this with this on our 7K's which is likely making the issue worse for us.

http://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/6_x/nx-os/deferral/Deferral-Notice-N7K-628.html

New Member

I've informed my Cisco

I've informed my Cisco Support guy about this thread.
Let's see what happens. I would be pleased to hear from you if there are any news on this topic.

New Member

I'm curious.  Are you running

I'm curious.  Are you running this on b-series with Mk81 VIC's and ESX 5.5?  We're on B230-M2's.

Cisco has asked that I pull it out and replace it with a VIC 1280.  I thought we had some, but don't.  What I did do is move a couple of the problem VM's to a B200M3 with the 1240, and the problem went away.  Moved it back to the B230's and the drops resurfaced immediately.

We'll be picking this up again at noon.  There is a newer version of the enic posted at vmwares site, at 2.1.2.50.  I might toss that on a blade and see.

New Member

I still have enic .42 version

I still have enic .42 version running on my esxi's.

I have here these blades, and i'll try all of them to confirm what you have seen:

B200M2 - M81KR
B230M2 - M81KR
B200M3 - VIC1240

Actually the troubling VMs are running on B200M3 with VIC1240 and we have the trouble there, so I'll move them around a bit and report what I've seen then.
My Supporter thinks the 6500 Coreswitches/routers have the error, not UCS. maybe the ASA FW Module or VSS/Hypervisor out of sync errors.
 

New Member

We eliminated the specific

We eliminated the specific network card a few days ago.  However we did obtain stability and are in a work around state.

Each FI is connected to both our 7K's (no VPC).  We disabled one of those two links on each FI.  Rock solid now.

So far Cisco hasn't been able to duplicate our issue.  We re just sitting and waiting now.

New Member

okay, tested all 3 bladetypes

okay, tested all 3 bladetypes with the two adapters. it's on all blades the same. after about 50 seconds pinglosses occur.
Will have another phonecall with cisco today. VMware stated it's not their fault and closed the request for now.

New Member

Today I've had a long

Today I've had a long phonecall and webex with Cisco.
What we've found out is the following:
The affected VMs (this who's mac addresses occur on the wrong uplink interface on the upstream switch)
are using Multicast Packages. I've configured following Portpinning in UCS for FI A and FI B:

1/25 - Uplink ESXi Traffic (Vmotion, HA, Mgmt-Console, etc.)
1/26 - Uplink for alle the VM Guest Machines

All Uplink use the same VLANs of the provided trunk.
MCAST Policy in UCS: IGMP Snooping State enabled, Querierstate disabled (IMHO the defaults)

What we see with "show ip igmp snooping VLAN <id>" is, that UCS always seems to use the first
Uplink that is able to handle IGMP as Querier Port, so here 1/25 but the VMs that are also doing MCAST are pinned
to Uplink 1/26 so if there's not much traffic and the VMs are doing MCAST the MAC is learned on the wrong Port at the
Upstream Switch. I manually disabled and re-enabled the Uplink 1/25 to force the querier to be detected on Port 1/26
and the error was gone.

Could anyone try to check if it's the same on your site?

 

Workarounds actually are:

1) disable IGMP in UCS and VM guests

2) disable / only use 1 Uplink per FI

3) disable all VM guest VLANs on the Uplink for ESXi Traffic and vice versa, in other words:
Only enable VLANs per Uplink which are absolutely needed

4) Swap the mac-pinning configs between the Uplinkports

5) Bind the Uplinkports as Portchannel and then as pinned Interface

 

Cisco says it's not a bug but a expected behavior. I cannont follow this, because if I a) do Portpinning and b) want / need to use IGMP
then IMHO ithe MCAST traffic should flow through the Uplinks configured via Portpinning and not only through one (the first) of them. Could anybody with older Firmware (2.1.xx) try to send me the IGMP Snooping Querier results seen with above mentioned command?

So for now I do not know what of the three options would be the best one for me :-(
Tending to do Workaround #4. What now?

 

Cisco Employee

Hello,I don't want this to be

Hello,

I don't want this to be taken as a recommendation, but more as an informative post in response to your concerns about the the Cisco tech may have said in regards to it being expected behavior. But just for reference, this would be in line with your option #3 you have listed above.

 

Layer 2 Disjoint Upstream Packet Forwarding in End-Host Mode

Server links (vNICs on the blades) are associated with a single uplink port (which may also be a PortChannel). This process is called pinning, and the selected external interface is called a pinned uplink port. The process of pinning can be statically configured (when the vNIC is defined), or dynamically configured by the system.

In Cisco UCS Manager Release 2.0, VLAN membership on uplinks is taken into account during the dynamic pinning process. VLANs assigned to a vNIC are used to find a matching uplink. If all associated VLANs of a vNIC are not found on an uplink, pinning failure will occur.
The traffic forwarding behavior differs from that in Cisco UCS Manager Release 1.4 and earlier in the way that the incoming broadcast and multicast traffic is handled. A designated receiver is chosen for each VLAN, rather than globally as in Cisco UCS Manager Release 1.4 and earlier.
 
This is from the following link:
So as you can see, that behavior has been present since 1.4 release and onwards. There will only be one designated receiver per vlan. This may be why the tech mentioned that as a viable workaround. in your case, it would be to ensure that the ports we need to multicast are set to do so on a specific vlan, and that that vlan is only allowed on one interface (or port-channel). If we have it enabled on multiple uplinks and they are not listening on the one they are pinned to, we may see that traffic be dropped.
 
Regards,
-Gabriel
26
Views
0
Helpful
21
Replies