So I am looking for thoughts on the following implementation.
I have three sets of Nexus5ks (PODS) that I want to setup the peer-keepalive links for.
For each POD I have configured the mgmt0 ports and connected to a L2 switch. This L2 switch is being used for each PODs peer-keepalive along with some other management services for our DC. My concern is that all PODs peer-keepalives are traversing this single switch and want to make sure that I fully understand what will happen if this switch goes down. We'll work diligently to restore service to this switch as other critical management services are running on it but the single-point of failure for 3 PODs peer-keepalives has me concerned.
So if the keepalive link goes down it is my understanding that all the vPCs will remain active and data forwarding will continue. That's good to know. But are there any other risks or caveats I should be aware of. What if another system failure occurs when this keepalive link is down? A switch reboots or a vPC drops?
Also, is there any failure scenario where all 3 PODS would lose data forwarding if this L2 switch fails that all the keepalives are going over?
I feel it would be overkill to setup a separate L2 switch for each POD for just this use. So am leveraging an existing L2 switch we use for other network management functions.
As you already know once vPC are operational, if the peer-keepalive link fails, then everything carries on as before. Both switches will still continue to forward traffic on their vPC member ports.
If you were then unfortunate enough to have a failure of the vPC peer link while the peer-keepalive is down, then you get into the scenario where the vPC member ports on the operational secondary device are taken down. You still have connectivity to downstream devices from the operational primary though, and so unless you have single attached devices on the secondary, you're still OK.
"What if another system failure occurs when this keepalive link is down? A switch reboots or a vPC drops?"
If one of the Nexus 5K switch reboots while the peer-keepalive were down, then the remaining N5K will remain or become operational primary and continue to forward traffic on the vPC member ports. If you lost both Nexus 5K of the same pod at the same time as the Layer-2 switch were down, then depending upon your code version and configuration, you could run into issues when they came back up. In the early days of vPC the peer-keepalive was required to initially establish vPC, but Cisco have addressed this issue from 5.0(2)N2(1) with the auto-recovery feature.
If a vPC drops on one or both of the peers e.g., due to a single link failure or the entire downstream device rebooting, then the ports and vPC becomes operational on both the peer devices once the downstream device is operational again. This is irrespective of the state of the peer-keepalive link.
The Virtual Port Channel Operations guide discusses failure scenarios (and more besides), and the use of the auto-recovery and is worth a read to ensure you fully understand the recovery options for all scenario.
In short, I believe that what you're planning is an acceptable risk.
We are pleased to announce availability of Beta software for 16.6.3. 16.6.3 will be the second rebuild on the 16.6 release train targeted towards Catalyst 9500/9400/9300/3850/3650 switching platforms. We are looking for early feedback from custome...