Nexus switch issues with Intel 10GbE cards and bonding/teaming?
Hi, is anyone aware of any quirks or special configurations required when connecting Intel 10GbE NICs to Nexus switches? We've run into a number of problems. Have I missed anything obvious? Details below.
Setup no. 1: Nexus 4900M, two Dell R510 with dual-port Intel 82599EB (8086:151c), two Dell R720xd with 2x Intel X540-AT2 (8086:1528). The servers are running CentOS6.4 and the 10GbE interfaces are bonded in mode 4 (802.3ad/LACP).
Problem no. 1: The R720xd boxes occasionally lose network connection. One of the interfaces in the bond goes down for 4 seconds, then recovers. This happens at random. Sometimes, the other interface in the bond goes down as well, before the first one has recovered, and that's when the machine loses network connection. The R510 boxes, however, do not exhibit this behaviour.
Setup no. 2: The Nexus 4900M has been replaced with a Nexus 3064T switch.
Problem no.2: The R510s and R720xds have swapped roles. About a week after after the switch was replaced, the R510 NICs started flapping at an exorbitant rate, going down several times every minute.
Workaround: I have unretired the 4900 and moved the R510 machines to it. Since then, no interface flapping has been observed. I.e. the R510 are on the 4900, and the R720xd on the 3064.
Yes, all Dell machines had their firmwares and BIOS updated to the latest available through OMSA. I also applied Intel's preboot updates, admittedly not the latest (v19 vs. 19.3).
I have now hard evidence that the problem has nothing to do with teaming as such. We have other machines here with the Intel 82599EB card (Proliant 360p g8), with only a single port in use, and they have exactly the same problem on the 3064. Yet another datapoint, another group of (custom-built) servers that have Intel X540-AT2 NICs do not show the problem. This exactly mirrors the behaviour of the Dells in the original post.
I don't think this applies. We observe this problem across driver versions, ranging from 3.9.15-k (latest CentOS6.4) over 3.17.3 to 3.18.7. The only correlation I have right now is the combination of NIC and switch.
I'm currently thinking along the lines of "interesting"/buggy driver and behavioural differences between the two switch models. Does anyone know if the setting for RX/TX flow control on the switches must or should match the NIC settings? I found that the 3064 defaults to RX/TX flow control off for all interfaces, whereas the 4900 enables it for connected ports with link up (all NICs have it enabled, but I don't know whether that's how the 4900 decides to enable). Nothing related to flow control has been explicitly configured.
While I still don't understand the cause of this problem, and still don't understand the specifics of flow control negotiation, some experimentation shows that I can work around the problem by either configuring all ports on the 3064 that link to 82599EB cards with
flowcontrol receive on flowcontrol send on
or, configure those interfaces on the server to turn the pause options off (autoneg/rx/tx) with ethtool.
We are pleased to announce availability of Beta software for 16.6.3.
16.6.3 will be the second rebuild on the 16.6 release train targeted
towards Catalyst 9500/9400/9300/3850/3650 switching platforms. We are
looking for early feedback from customers befor...
Introduction Featured Speakers Luis Espejel is the Telecommunications
Manager of IENova, an Oil & Gas company. Currently he works with Cisco
IOS® and Cisco IOS XE platforms, and NX to some extent. He has also
worked as a Senior Engineer with the Routing P...
In this session you can learn more about Layer 3 multicast and the best
practices to identify possible threats and take security measures. It
provides an overview of basic multicast, the best security practices for
use of this technology, and recommendati...