cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
22739
Views
23
Helpful
13
Replies

Unity Connection Split-brain Recovery

spizzo
Level 1
Level 1

I have a situation where a Unity Connection cluster has been deployed over a WAN, and there have been connectivity issues between the sites. When this happens, both servers become primary, and when the connection is restored the cluster goes into Split-brain recovery and does not answer calls. The problem with this, is that it can happen during the day, and most calls are routed to Unity Connection first. This means during split-brain recovery, incoming calls are going to busy signals. Anyone have a good workaround for this behavior? I would prefer the cluster be out of sync during the day than not answer calls. I've been disconnecting the Unity Connections subscriber from the network to stop Split-brain recovery, and connecting it back to the network later when an outage is acceptable.

13 Replies 13

Saurabh Agnihotri
Cisco Employee
Cisco Employee

Hi Steven,

What you are experiencing seems to be an expected behaviour. Please refer to the following from Cluster Configuration and Administration Guide for Cisco Unity Connection:

Effects on Calls in Progress When Server Status Changes

When the status of a Cisco Unity Connection server changes, the effects  on calls in progress depend on the final status of the server that is  handling a call and on the condition of the network. Table 3-3 describes the effects.  

Table 3-3     Effects on Calls in Progress When Server Status Changes 

Status Change
Effects

Primary to Secondary

When the status change is initiated manually, calls in progress are not affected.

When the status change is automatic, effects on calls in progress depend on the critical service that stopped.

Secondary to Primary

When the status change is initiated manually, calls in progress are not affected.

When the status change is automatic, effects on calls in progress depend on the critical service that stopped.

Secondary to Deactivated

Calls in progress are dropped.

To prevent dropped calls, on the Cluster Management page in Cisco Unity  Connection Serviceability, click Stop Taking Calls for the server, wait  until all calls have ended, and deactivate the server.

Primary or Secondary
to Replicating Data

Calls in progress are not affected.

Primary or Secondary
to Split Brain Recovery

Calls in progress are not affected.

If network connections are lost, then calls in progress may be dropped, depending on the nature of the network problem.

Anyways, can you also provide the following information:

1)Unity Connection version (Complete Build)

2)Link Speed between the Pub and the Sub.

Have you tried checking the status of the "Connection Conversation Manager" Service when the callers hear fast busy?

Can you see the call arriving on one of the CUC servers using Port Status monitor when you hear fast busy?

Regards,

Saurabh Agnihotri

Thanks for your response Saurabh. I do understand this is expected behavior, but I'm looking for a workaround, if one exists. We're working to get a redundant WAN connection in place, but there will still be single points of failure in the network. I hate the idea that a WAN link going down for a short time means voice mail will not answer calls for 15+ minutes. It seems like this is an oversight for WAN deployments. In my experience, WAN links are the most common thing to fail in a network. The only way to avoid this behavior seems to be (very expensive) redundancy. Would you agree?

Unity Connection Version: 8.5.10000-206

WAN Speed: 100Mb/s

Anyone else find this in a challenge in their environment?

Going into SBR should not cause calls to drop.  While SBR is running, new messages left are not delivered until SBR is complete (because the MTA process which delivers messages is stopped as part of the SBR process). 

If calls are being dropped, there is most likely something going on between each individual Connection server and CUCM (assuming that's what it's integrated with).  SBR does not stop any service required to accept a new message.

I wish it did work that way, but here is Cisco's description of Split-brain mode:

Split Brain Recovery

After detecting two servers with Primary status: Assigns Primary status to the publisher server.

Updates the database and message store on the server that is determined to have Primary status.

Replicates data to the other server.

•Temporarily does not answer phone calls or take messages.

Temporarily does not synchronize voice messages in Connection and Exchange mailboxes if single inbox is turned on in Connection 8.5 and later.

Temporarily does not connect with clients such as email applications and the web tools available through the Cisco PCA.

NoteThis status lasts only a few minutes, after which the previous status resumes for the server.

My issue isn't with current calls, or message delivery, it's with UC not answering phone calls during this time frame, which lasts 15+ minutes. This means that my call handlers which process calls are unavailable. There seems to be a lot of confusion about this. TAC engineers are often unaware of this.

You should have an issue with that.  It's simply NOT true that the system (either publisher or subscriber) won't accept calls during SBR by design (unless there is something else going on).  I've spent a lot of time working with failover/recovery from when the redundancy functionality was first released (see https://supportforums.cisco.com/docs/DOC-5963, which was based on 7.0 but still for the most part correct). 

I would focus on what CUCM servers each individual Connection server is communicating with.  Maybe the server is trying to re-register to another CUCM server or something.  There's got to be something happening (I'm not doubting that it couldn't be the result of SBR, just the notion that it's expected behavior). 

Thanks Markus, you obviously have a lot of experience with UC clustering.

I'll try to gather some more information when I force SBR this evening. What I see is that during SBR UC unregisters from CUCM for quite a while. I'm not quite sure how a mis-configuration could allow for it to not register it's ports specifically during SBR. After SBR it registers it's ports and answers calls. The UC cluster itself fails over from one CUCM to another just fine. Could be specific to my version I suppose, but I've seen this happen at other clients as well, always on 8.X.

There's no doubt that the Cisco documentation says it temporarily stops answering calls. TAC engineers I've discussed this with usually start with your opinion that it should continue taking calls, but after researching, come back and tell me this is working as designed. My team has had multiple TAC engineers tell us this.

Markus, what would you say about the quote I pulled from the Cluster Configuration and Administration Guide for Unity Connection 8.x? Do you think it temporarily doesn't answer calls for a very short period of time? Do you think it's wrong?

The documentation is inaccurate or incomplete.  I wouldn't swear that there couldn't be some anomolous behavior right around the time a state transition occurs (on the order of milliseconds), but not where it won't take calls the entire SBR period.  The CUCM<->Connection link/registration should not go down as a part of SBR.  Obviously if there is a real network failure/recovery, it's possibe that this link to the phone system could be affected, but SBR shouldn't have anything to do with it. 

Steven,

It sounds as if the WAN connectivity issues are potentially a problem in general within your environment.  At any rate, could you provide some more details about your UC environment?  Specifically, the type of info I'd be interested in is:

CUCM cluster info - where are the Subscribers in relation to the Unity Connection servers?

Which CUCM subscribers do each Unity Connection server register to (and relative priority)?

How do you have the CUCM configured to route calls to Unity Connection - i.e., Hunt List info, Line Group info, etc.

Hailey

Thanks for your reply David, but I don't think it's necessary to go there. I have some interesting information from some testing I did last night.

I have another client running UC 8.5.1SU3, they previously had the same problem when they were on 8.5.1.10000-xxx. We tested cluster failover and calls were answered during SBR with SU3. I then went back to the client that is experiencing the problem, and looked at the differences during SBR there. The issue is that during SBR the Connection Conversation Manager services stops for the duration of SBR on both servers. I'm going to open a case with TAC, because I don't see a bug report for anything like this.

Having a client that experienced the problem on 8.5.1.10000 but not SU3 (with the same UC/CUCM configuration) makes me think this is bug related. We are running a shipping version of UC on the servers having the problem, and that's never recommended.

I'll update this discussion with my findings. The fact that the Cisco docs lined up with what I experienced made me second guess how it was supposed to work, but I'm convinced UC should answer calls during SBR. Thanks for all your advice.

There are some SBR-related bugs in the 8.51 and 8.5.1(SU3) code that I came across as well. It may be worthwhile to search for SBR in the Bug Toolkit for those versions to see if anything sparks your interest as well. From a code perspective, I would definitely recommend running SU3 at this point if you can.

So I opened a TAC case and they are recommending upgrade to SU3. I'm realized 8.5.1.10000-xxx is actually the base release, so I'm not even at SU1. I couldn't find any bugs specific to UC not answering calls, but I am also running into the bug where it takes a very long time for SBR to complete. After I upgrade to SU3 I'll let everyone here know the result.

asif khurram
Level 1
Level 1

I had a same issue with unity connection 8.6.2ES81.24087-81. When both the servers in cluster (connected through WAN) were showing primary and cluster management was showing split brain.  I tested and WAN connection between two was fine. I tried restarting pub and sub with no luck. Finally I shutdown publisher for a while and when I powered 'ON' after a while when server restored split brain error was gone and both the servers were behaving normally.

Here are some important commands you can issue on both nodes:
show cuc cluster status
utils dbreplication runtimestate
utils diagnose module validate_network
utils ntp status

Run "utils cuc cluster Activate" command and monitor.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: