I have a question about a special splitted brain situation with 2 VSM we had seen.
We have a good working VM-DataCenter with VSM-A (active) on host ESX-1 und a VSM-B (standby) on host ESX-2. We use version 4.0.4.SV1.3b for the VSMs and VEMs.
Then the host ESX-2 got problems with its network interfaces where the VSM system vlans are transmitted. So the heartbeat connection between the VSM's was broken. In this situation VSM-B changed the status from standby to active, but was not reachable via IP (I could see it only via console in VCenter). In this situation I couldn't see any hosts and no standby VSM at VSM-B (via show module). Also VSM-A was still active and I could see both hosts, but no Standby-VSM. All this was ok so far.
Now the problem:
When the network connection for host ESX-2 came back both active VSM started seeing each other. In this moment VSM-A was automatic rebootet and the VSM-B stayed active. After the reboot VSM-A was standby.
That's strange in my view, because VSM-B was the standby-VSM before the problems with host ESX-2 happened. And VSM-B had no good configuration during the splitted brain situation (it didn't know any hosts). My exception was, that the VSM that was active before the splitted brain situation happend wins the active/standby election.
So my question:
When 2 active VSMs starting seeing each other, which VSM will be the active one ? What is the trigger for this?
when two active VSM in the same domain resume contact with each other the primary VSM will always reload. This is a static decision which does not depend on what VSM was active before the active-active situation occurred.
In a system with two VSM, the redundancy role of one of them is always configured as "primary" and the other one as "secondary", during the initial setup (see "show system redundancy state" output). The terms primary and secondary are equivalent to "module 1" and "module 2" in the virtual chassis, not to be confused with "active" and "standby", which are the redundancy states. When the system boots up, the primary VSM normally becomes the active and the secondary VSM becomes the standby. However, depending on the number of system switchovers after that, the active VSM could be the primary or the secondary at any point in time.
Redundancy role is equivalent to module number and it is not changed since initial setup, unless the user explicitly does it. Redundancy state tells you which VSM is controlling the switch and which one is backing it up, and it may change depending on operational conditions
Thanks Juan, that answers my questions completely.
But I don't think it's a good behaviour, when always the primary VSM reloads after a split brain situation.
When the ESX interface with the system vlans for the secondary VSM is flapping, it causes a split brain situation and some seconds later when the VSMs are seeing each other again the good working primary-VSM reloads. Some seconds later the secondary VSM is isolated again about the nic-flapping and the primary VSM comes back and is active. And some seconds later the game would start again when the VSM are seeing each other.
In this situation (with a flapping NIC at the Secndry-VSM) I would have no working VSM, because the primary VSM is permanent reloading and the secondary VSM disappears every 5 seconds.
And this is not a very unlikely example, we had this situation this week.
Introduction This article will help you understand the steps on how to
download the UCS licenses from the Cisco Systems website and then
installing it on the UCS. The redacted (blue lines) just covers up
certain numbers for privacy please do not take them...
Introduction This article will help you understand and educate the
customer on how to clear their "expired licenses"
(license-graceperiod-expired) from their UCS-M. If a customer just
purchased a license and needs a step by step guide on how to download
Introduction Prepositioning is a powerful tools on the WAAS platform but
it is not always easy to figure out why your jobs are failing when
trying to retrieve the files.Here is a method that should help you to
figure out the reason why they are not succes...