Cisco Unity Connection cluster issue

Unanswered Question
Nov 24th, 2010

Hello,

I have a Cisco Untiy Connection cluster of two servers, one publisher and one subscriber.  They are running software: 7.1.3ES43.33034-43.  Yesterday at about 8:43 I lost connection to the subscriber server.  I could still ping it but was unable to get to it via GUI nor SSH.  I spoke to TAC yesterday and they told me to simply reboot the server.  I was a little concerned about this because I don't want to cause some kind of split-brain effect.  I work in a hospital and it is imperative that the voicemail system stay up.  My only other option would be to do this at 4:00 am on a Sunday.  Any suggestions?

Thanks

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (6 ratings)
David Hailey Wed, 11/24/2010 - 06:57

I wouldn't worry too much about causing a Split-Brain condition.  In fact, a brief period of split-brain is normal during recovery.  In your scenario - I'd go with TAC and reboot the server.  Technically, if you have your ports configured correctly for failover and etc then you should be able to reboot the subscriber at any time.  Here is some additional tidbits on Split-Brain and what it is:

Effects of a Split-Brain Condition

When the servers in a Cisco Unity Connection cluster have Primary status  at the same time (for example, when the servers have lost their  connection with each other), both servers handle incoming calls (answer  phone calls and take messages), send message notifications, send MWI  requests, and accept changes to the administrative interfaces (such as  Connection Administration). However, the servers do not replicate the  database and message store to each other and do not receive replicated  data from each other.

When the connection between the servers is restored, the status of the  servers temporarily changes to Split Brain Recovery while the data is  replicated between the servers and MWI settings are coordinated. When  the recovery process is complete, the publisher server has Primary  status and the other server has Secondary status.

Hailey

Please rate helpful posts!

latintrpt Wed, 11/24/2010 - 07:02

Thank you for your fast response.

The reason I bring this up is because I upgrade to this code about a month ago.  I first upgrade the publisher and then the subscriber.  When the subscriber came back up it had caused a split-brain effect for at least 15-20 mins.  At that time when I would try to retrieve my voice messages, Unity Connection told me Voice Messages were unavailable at the time.

Do you recommend I do this during a downtime then?

David Hailey Wed, 11/24/2010 - 07:06

Again, split-brain during recovery is normal - and recovery after an upgrade takes much longer than a simple reboot.  It takes approximately 15-20 minutes for all of your Tomcat and other web services to start before a server is recognized...so this could be the lag you saw.  During it during a downtime is never a bad idea but if you are experiencing a bottleneck with VM then I'd do it sooner rather than later.  That is totally your call.  If you are looking for what I believe to be the most stable version of 7.1 code then it is 7.1.3.32900-4 or 7.1(3b)SU2.  Tried and tested.

Why don't you just do the reboot the afternoon?   I'm sure you'll have some downtime as folks part for the holidays.

Hailey

Please rate helpful posts!

jeff.singh_2 Tue, 06/03/2014 - 04:32

Hi David and all,

we have a unity connection 9.x pair. after an IP address change on the subscriber we are getting the error message:

Communication is not functioning correctly between the servers in the Cisco Unity Connection cluster’

Both servers respectively show its own status as a publisher but lost communication with the other server (if we check cluster management from either server).

 show tech network hosts – displays correct info.

show perf query class "Number of Replicates Created and State of Replication"
==>query class :

 - Perf class (Number of Replicates Created and State of Replication) has instances and values:
    ReplicateCount  -> Number of Replicates Created   = 603
    ReplicateCount  -> Replicate_State                = 2

On both servers the replication is showing as 2 – so its good.

Various reboots have been done, services are up and working and there are no firewalls between server’s .

Any advice appreciated.

 

TIA…Jeff

David Hailey Tue, 06/03/2014 - 08:27

Can you post the output of the "show cuc cluster status" from the CLI of each server?

AskQ2Forum_2 Thu, 06/05/2014 - 01:27

David, thanks for assistance.

from the Pub we get:

ACE_File_Lock::ACE_File_Lock: Permission denied /dev/shm/CCM_GENstatusLock_0

 

Server Name  Member ID  Server State  Internal State           Reason

-----------  ---------  ------------  -----------------------  -------

dvbcucaw01   0          Primary       Pri Active Disconnected  Normal

dvbcucbw01   1          Disconnected  Unknown                  Unknown

 

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED

-----------------------------------------------------------------------

g_ciscounity_pub      100 Active   Local           0

g_ciscounity_sub1     101 Active   Connecting 40188370 Jun  5 09:06:29

 

From the sub we get:

Server Name  Member ID  Server State  Internal State                Reason

-----------  ---------  ------------  ----------------------------  -------

dvbcucaw01   0          Disconnected  Unknown                       Unknown

dvbcucbw01   1          Primary       Sec Act Primary Disconnected  Normal

 

Database replication is not active

 

AskQ2Forum_2 Mon, 06/16/2014 - 12:50

Cisco TAC with root level access edited one of the hosts files to correct the IP address.

Brian Meade Tue, 06/03/2014 - 12:45

You should probably open a new thread for this issue.

David Hailey Tue, 06/03/2014 - 12:55

There is actually a duplicate post in another topic...so just keep that one thread going.  Same question / response going there.

rob.huffman Wed, 11/24/2010 - 07:18

Hi latintrpt,

I will add my +5 point vote for this good info from Hailey

Just to let you know we have had to re-boot our Sub during office hours

on two occasions. This was done without any noticeable effect for our users.

Just make sure the Pub is set as primary and that the Sub has been set

to "Stop taking calls" (give some time for calls to clear)  before rebooting.

On another related note, we had a time when our Sub was out of commission completely

for a few days due to Tomcat Service Bug. Again, our users saw no issues during this time.

The CUC Cluster design has been well thought out by Cisco and is indeed very resilient!

Cheers!

Rob

latintrpt Wed, 11/24/2010 - 08:20

Thanks for the information guys.

Rob, I am unable to to set the Sub to "Stop Taking Calls" as it says "Not Available"  Server status for the subscriber is "Not Reachable".

The Publisher does say "Primary"

David Hailey Wed, 11/24/2010 - 08:32

If the Publisher is primary and you are not experiencing call issues, you simply need to reboot the Subscriber - either now or later.

Another way to force the Sub to not take calls is to remove it's ports from the CUCM hunt list/line group configurations.

Hailey

yahsiel2004 Mon, 06/16/2014 - 13:44

+5 Rob and David! You guys always have good info.

Latin,

When you verified the Subs "Port Manager" status, did you log in through the Pub or the Sub? If you logged in through the Pub, try logging in through the Sub. So you can verify whether it's "Taking Calls" or also stating "Not Available".

Regards,

Yosh

Actions

Login or Register to take actions

This Discussion

Posted November 24, 2010 at 6:49 AM
Updated November 24, 2010 at 6:50 AM
Stats:
Replies:13 Overall Rating:5
Views:2982 Votes:0
Shares:0
Tags: No tags.