06-17-2014 03:30 AM - edited 03-01-2019 11:42 AM
Since the update to 2.2(1d) we have network timeouts / outtages between vsphere vm's hosted on ucs b-series servers and the rest of the world outside vmware / ucs. first occured with sles10 guests, but windows 2008 and more are also involved. any idas or similar experiences?
06-17-2014 08:04 AM
1) What version did you update from?
2) What version of ESXi, and what version of fNIC/eNIC drivers are you running?
3) Are your blade firmwares and CIMC standardized onto 2.2(1d)?
06-17-2014 10:39 PM
I updated from 2.1(3a) and the ESXi is 5.5u1 with the NFS APD Hotfix and all other fixes.
UCS-manager, FI, Chassis and all Blades are at the same firmware level 2.2(1d).
fNic driver information:
vmkload_mod -s fnic
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/fnic
Version: Version 1.5.0.45, Build: 1198611, Interface: 9.2 Built on: Jul 31 2013
License: GPLv2
Name-space: com.cisco.fnic#9.2.2.0
Required name-spaces:
com.vmware.libfcoe#9.2.2.0
com.vmware.libfc#9.2.2.0
com.vmware.driverAPI#9.2.2.0
com.vmware.vmkapi#v2_2_0_0
eNic driver information:
ethtool -i vmnic0
driver: enic
version: 2.1.2.38
firmware-version: 2.2(1d)
bus-info: 0000:08:00.0
06-18-2014 08:58 AM
That's pretty rough, I'd open a TAC case at this point, something is not right.
06-18-2014 09:50 AM
Is this the hotfix that you are referring to
http://kb.vmware.com/kb/2076392
Intermittent NFS APDs on VMware ESXi 5.5 U1 (2076392)
Up to very recently, VMware didn't have a fix !
06-18-2014 12:21 PM
I mean hotfix KB2077360:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2077360
it's the one for esxi 5.5u1
06-17-2014 11:53 PM
this moment I upgraded to eNIC driver 2.1.2.42
let's see what happens...
06-18-2014 08:53 AM
seems that the lags / connection disrupts still exist.
error-counters and such are all zero.
sniffing the networktraffic results in sent packages to the vm-machines but they don't answer them. no aborts in servicelayer or so.
06-23-2014 08:42 AM
Have you found a resolution to this issue yet? we are experiencing something of the same issue...
Thanks
06-23-2014 12:38 PM
No solution yet.
Actually I've got open Servicerequests at Cisco and VMware site.
Cisco didn't do any action yet. VMware analysis my ESXi Logbundles.
Mainly affected are SLES10 and Win2008(R2) VM guests.
Could you tell me what your issues are? Maybe it helps me to get any clue what might be going on here. STill not sure if it's an Cisco or VMware related problem, since if've patched two things at the same time. I should have known...
06-23-2014 12:55 PM
are you using vswitch, DVS, N1k ?
06-23-2014 12:57 PM
no. old style vlan tagged host networking.
07-07-2014 08:20 AM
Did you happen to have any success on this yet? We've been battling the same thing for some time and not getting anywhere with our cases.
07-07-2014 10:57 AM
VMware and Cisco have no solution yet.
I've found out, that the MAC addresses of those affected VM guests switch / flap
between our two uplinks of the same(!) fabric-interconnect. It's seen on our upstream 6500 series coreswitches/-routers. Flapping begins after about 50 seconds of inactivity which is the default timeout for mac-learning in spanning-tree / STP. If I do pingloops from the affected VMs to outside targets this doesn't occur and they're reachable again. It's 100% reproduceable. It's clear that if the MAC comes up on the wrong (not pinned) uplink the traffic from the 6500 to the FI goes to nowhere, that's the normal endhost-mode behavior.
But I've set pingroups / portpinning for those ESXi vNIC adapters in UCSM so that this behavior should be forbidden by config.
But it happens. Flapping between the pinports of fi_a and fi_b I'd imagine, but it's between my two uplinks for _one_ FI.
Could you confirm that?
Maybe this would help to get Cisco a hint on this.
I'm strongly in the mood to downgrade to my old 2.1(3a) again, but actually I must not change anything in UCSM by Cisco support advice. Annoying!
07-07-2014 01:18 PM
I'm sitting on the phone with Cisco at this very moment, and have confirmed the same thing. Its being escalated to the business unit (the coders) as a sev 1 for a workaround and have new code created to address this. I'll post anything I hear.
We also got this with this on our 7K's which is likely making the issue worse for us.
http://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/6_x/nx-os/deferral/Deferral-Notice-N7K-628.html
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide