Split Brain Central Controller

Unanswered Question
Mar 5th, 2009

I need to know a scenario that would cause a split brain Central Controller. I have my Central Controller split across two different sites and we recently had a failure of the Public and Private WAN. Each site still had local connectivity (Rogger A could communicate with PG A and Rogger B could communicate with PG B) but any synchronization execution that would have taking place between the duplexed pairs would have failed. There are currently three PGs configured in the Router configuration. I am assuming that if the Roggers could not communicate, they would both assume that the other side was down and try to take control. Finally, would the system automatically recover from a split brain central controller or would manually steps be required? Thanks

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.

John - a split brain scenario would normally occur if you have an even number of pg's checked in setup. When one router goes down, the other doesn't take over.

RouterA can run in simplex if half the pg's are available. RouterB won't run in simplex because half is only considered a majority to the A side - not the b side. So when a router goes down,

it tests other side and each side verifies if it's connected to the majority of the pg's. If the b side can't connect to a majority+1, it won't go active.

Based on your configuration, I would think you're okay with 3 pg's configured.

I think there's a pdf on this somewhere, if I find it I'll post it.

Robb

john.miccio Fri, 03/06/2009 - 08:05

Rob,

Thanks for the quick reply. I think I read something similar to this in the SRND, but if you are able to find a separate document on this question specifically I would appreciate any help I can get. I must have had the wrong understanding of Split Brain. I was under the assumption that Spit Brain meant that both sides thought they were in control because they could not communicate with their duplexed partner. Does that mean that both sides of the Router are not Active when split brain occurs?

Edward Umansky Fri, 03/06/2009 - 11:13

John you're correct, split brain refers to both sides being active at once. ICM tries to avoid this with the scheme that Rob described: side A will go active only if it can communicate with at least half of the configured PG's, while side B will only go active if it can communicate with a majority (more than half). If you have PG's which are split across the WAN then it is possible for both Routers to think they are communicating with a majority of PG's, since the PG's themselves can become split brain. I imagine that's what is happening in your scenario. There are different ways to deal with this, for example putting a simplex "dummy" PG only on side A or at a third site. It depends on what sites you have available and what type of failover behavior you are looking for.

cvenour Sun, 03/08/2009 - 18:28

The only time I have seen a split brain happen was when certain parts of the the customer's network was failing intermittently. They had 3 call centre sites - A, B & C - with the side A CC at Site A and side B CC at a data centre site - Site D.

The customer had 9 PGs - 3 agent, 2 IPIVR, 3 MR PGs - with one of each at each call centre site.

An intermittent failure of a fibre port connector was occurring at site A's connection to Site D, which resulted in the network toggling between a primary and a backup route every few seconds. This resulted in a break in the CC traffic, but left each CC able to talk to a majority of PGs (due to the WAN acrhictecture).

This caused both CC sides to declare themselves the primary / active side, which caused the PGs to receive different route results (because of a lack of CC sync). This was ugly.

The customer rectified this by changing the routes, and moving the CCs to different links.

The recommendations listed in the above posts cover off what you need to do to minimise the possibility of a split brain issue occurring. But even if you get this 100% correct, make sure the WAN routing is such that you can't end up with the scenario I've mentioned above.

C.

Actions

This Discussion