Fail Over and Redundancy with UCCE 7.5

Unanswered Question
Jul 8th, 2010
User Badges:

I have a customer  that is installing UCCE and they want to run side A and side B in stand alone if the visable and private network are both down.. Based on the SRND it states the system looks at the PG with the most active connections and takes over and the other side goes dark. I am desging this in a distributed mode with agants in both sites. Any ideas other than Parent Child.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (1 ratings)
Loading.

... the system looks at the PG with the most active connections and takes over and the other side goes dark.


Not quite. Behaviour of a duplex Router pair when the private network breaks is a complex affair.


As you probably know, the MDS pairs form a "synchronized zone" - one MDS will be PAIRED-ENABLED and the other PAIRED-DISABLED.


Consider all the PGs out there. On some PGs, the active link of the pgagent will be connected to the ccagent on the enabled side, while on the remainder of the PGs, the pgagent active link will be connected to the disabled side.


When a pgagent has an active link to the disabled side, that MDS cannot set the message order - it has to send the message to its peer MDS (enabled), who sets the message order, and now both Routers get the message in the same order at the same time.


Therefore, when the private network breaks, any PGs that have the active link connected to the disabled side will realign to the enabled side. The idle side remains connected - it's just a state change.


Idle paths and active paths both count for device majority.


The rules for the enabled side are simple: if it has device majority, it goes straight to ISOLATED-ENABLED. If it doesn't, it goes to ISOLATED-DISABLED.


The disabled side is more complex. First it checks for device majority. If it has this, it initiates the TOS (test other side) process. If every PG it can communicate with reports that it has no communication to the other side, then it will promote itself to  ISOLATED-ENABLED.


If the private network breaks and the public network is affected such that neither side has device majority, they both go disabled. Assuming the private link stays down, but the public network starts to come back in stages, eventually the majority of the PGs will be able to talk to one of the disabled sides, and then that will initiate the TOS process, and will go enabled.


Now let's consider what you have - you say "agents at both sites".


Let's imagine for a moment you have a 3rd site and 4th site that have no agents - they are just for the central controller. You have a dedicated link between sites 3 and 4 for the private network, and a public network out to sites 1 and 2


At sites 1 and 2, you have a Call Manager cluster, pair of PGs etc.


If the private network goes down, one of the sides will run simplex until the network is restored. Routing at sites 1 and 2 is unaffected.



If the public network to site 1 is down, routing at site 1 is broken until the network is restored. Site 2 is unaffected.




If the public network to site 2 is down, routing at site 2 is broken until the network is restored. Site 1 is unaffected.


If both networks are down, the whole system is isolated, no routing occurs until the visible has come back to the point where one of the sides will come up as ISOLATED-ENABLED.


Now what happens when we colocate the central controllers at the agent sites as in your model. Have we improved the situation? On the surface it looks like we have - and that's what your customer is saying with "they want to run side A and side B in stand alone if the visable and private network are both down".


When the private link breaks and the public link breaks, each router is ISOLATED-DISABLED and cannot come up because it only sees 1 of 2 PGs (the ones on the LAN at the site). So now you are down on both sites.


You might address this by installing at site 1 a third PG, configured in the normal way (it doesn't do anything) talking to both Call Routers, one local, one remote. It can be simplex.


Now when the private link breaks and the public link breaks, site 1 can see the majority of the PGs so it comes up in ISOLATED-ENABLED. Routing resumes at site 1, but site 2 remains off the air. This is the best result you can achieve.


The most important thing to think about is this: when the private network comes back up, the synchronizers try to do a state transfer. Assuming success, the synchronizers change to PAIRED mode. Now the routers and loggers will exchange state. If each site had been working in simplex mode ("split brain"), then when they come together you will have a totally messed up database. This corrupted state will most likely be unrecoverable.


It has happened in the past. I'll spare you the gory details.


Actions

This Discussion