Question about where multicast needs to be enable to get OTV EDs to form adjancencies
We have four 7Ks (two in Data Center 1 and two in Data Center 2) which are used to implement OTV in a multi-homed configuration. I've included a PDF which shows the L2 and L3 topology. When one of the links went down (as depicited in the 2-page diagram as the link between DC1-7K2 and DC2-7K2) OTV didn't behave as expected. Some VLANs worked and some did not (turns out only VLANS for which DCx-7K1-OTV was AED worked). Upon further review, the EDs were not forming adjacencies properly. The only working adjacency was between the two EDs still able to reach each other across the working L3 link between DC1 and DC2. I suspect that this has something to do with misconfiguration or sub-optimal configuration which is preventing multicast traffic from reaching all of the EDs (I should mention that I was not involved in the design or configuration of OTV). Unfortunately my knowledge of multicast is very weak so I'm hoping to get some insight from the community.
One possibility which was brought to my attention is that a reverse path forwarding check is causing packets to be dropped. Referencing the diagram, for some reason the L3 links between DC1-7K1 <-> DC1-7K2 and DC2-7K1<->DC2-7K2 were set to OSPF cost 1000 deliberately to make these links less preferred. Traffic leaving the join interface on DC1-7K2-OTV, because of the link failure, doesn't go DC1-7K2-OTV->DC1-7K2->DC1-7K1, etc. Instead, it goes DC1-7K2-OTV->DC1-7K2 ->DC1-7K1-PROD->DC1-7K1, etc. That doesn't seem like that's optimal routing. My thought is that once the traffic is in the core it should not go back down into distribution and then back up to core. It should stay in core. Whether or not that's actually causing RPF check to fail is something I don't know for sure.
The other concern is that multicast is NOT enabled on the L3 interfaces which connect DCx-7Kx to DCx-7Kx-PROD/TEST which means that when the traffic goes down to the distribution VDCs and back up to core the traffic is traversing several L3 interface on which multicast is not enabled. Is that an issue?
One question I have about OTV adjancenies. Based on the topology and assuming all of the configuration is valid, if a single L3 link between DC1 and DC2 fails should I still see three otv adjancenies on each of the EDs?
Re: Question about where multicast needs to be enable to get OTV
I don't normally do this (especially on a subject I know so little about) but since the solution has to do with changes from one version of code to another I think it's important to update this discussion so that others who may be affected by this can check their configuration and make changes as needed.
First a little history. The design laid out in the topology diagram attached to the original post was designed on 5.0(x) NX-OS code by a third-party design engineer over 18 months ago. The design engineer deliberately configured multicast on the transport network in the way that he did so that DC1-7K1-OTV <-> DC2-7K1-OTV would form an otv adjacency using DC1-7K1<->DC2-7K1 and DC1-7K2-OTV <-> DC2-7K2-OTV would form an adjancency using DC1-7K2<->DC2-7K2. AED would be determined using the private site VLANs on each side. This design did work and was tested and verified to work on 5.0(x) code. The design engineer ran this config by Cisco TAC and it was blessed by Cisco. As of 5.2(1) this is no longer valid. In a multi-homed configuration site AND overlay adjacencies are mandatory. This is described in this whitepaper http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DCI/whitepaper/DCI_1.html. From the document:
"The dual site adjacency state (and not simply the Site Adjacency established on the site VLAN) is now used to determine the Authoritative Edge Device role for each extended data VLAN. Each OTV edge device can now proactively inform their neighbors in a local site about their capability to become Authoritative Edge Device (AED) and its forwarding readiness. In other words, if something happens on an OTV device that prevents it from performing its LAN extension functionalities, it can now inform its neighbor about this and let itself excluded from the AED election process."
We upgraded to a version of 6.1(x) about 4 months ago and did not do a thorough enough test of OTV at that time to see the effects of this so since the upgrade we hadn't been running on a working configuration. In order for OTV to work seamlessly during a link failure (at least with our topology) multicast must be configured so that all EDs can form adjacencies with all other EDs. Otherwise, the AEDs will not relinquish authority over their VLANs and traffic will stop flowing for VLANs which are authoritative on the overlay affected by the link failure. If you issue "show otv adjacency" at the CLI on any of the EDs you should see adjacencies for all of the other EDs. If you issue "show otv site detail" and the "Dual Adjacency State" is anything but "FULL" something is broken and needs to be corrected.
Regarding the suggestion that a RPF check was the culprit (as mentioned in the OP), that was not the case and a simple CLI command confirmed that. "show ip mroute summary rpf-failed" will tell you how many multicast packets were discarded by the switch due to RPF check. it was 0 across the board. The root cause of our issue was that a SPT did not exist to allow all source/receivers to communicate across the transport network. I found that the most effective way to determine this was to issue "show ip mroute " for each ED (done on the transport network) to map out the incoming and outgoing interfaces. It became clear that the path included interfaces which where either not enable for multicast or were going to devices which were not configured to support multicast routing.
Moquery is the command line cousin of Vizore, it's very helpful and efficient sometimes during the troubleshooting. This article aims to provide moquery cheat sheet to the users for some most common seen scenarios.
Here is the checklist before customers/partners contact Cisco TAC:
Firmware Version of APIC and Switch
Download Switch and APIC techsupport logs
Problem description (Symptoms with details)
Business impact (eg, what kind of services...
moquery usageAPIC moquerySwitchmoquery
This document discuss a common issue observed during the VMM integration & VM workload migration to ACI fabric.
VMware Virtual machines are hosted in Cisco UCS-B seri...