cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1429
Views
5
Helpful
3
Replies

CUC 10.0.1 cluster status stuck in Split Brain Recovery (SBR) on Primary server - HA reports fine.

Erick Bergquist
Level 6
Level 6

Hi,

Have a 10.01.11900 CUC cluster and everything is working fine (no one having issues with voice mail, etc) but the cluster status reports is not consistent. 

DBreplication is showing 2 on both servers. 

 

Primary unity server cluster status shows Primary/split brain recovery.

HA Unity server cluster status shows Primary/Secondary.

 

utils diagnose test - everything tests fine except the tomcat_connectors test.

 

test - tomcat_connectors   : Failed - The HTTPS port is not responding to local requests.  Please collect all of the Tomcat logs for root cause analysis: file get activelog tomcat/logs/*

 

We've shutdown the HA server and rebooted primary, and then waited awhile after primary was back up/active before bringing the HA server back up and still same.

We reset DB replication and same. 

On the HA server I made the HA primary and the cluster status flipped to Seconday/Primary and I then made primary the primary again, but the primary server cluster status always shows Split Brain Recovery for the secondary/HA server. 

 

No core dumps on either server and all services are started. 

 

Any one seen this before or have any thoughts?  I have a TAC Case on this but so far in same boat. 

 

Would the utils cuc cluster renegotiate command help? Did not replace a server so don't really want to overwrite data to publisher server. Issue seems to be with the publisher since HA shows fine but not sure. I don't want to lose messages/etc so don't want really want to run these commands.  

Thanks.

3 Replies 3

Anirudh Mavilakandy
Cisco Employee
Cisco Employee

Enable all levels of CuSRM Micro traces. Wait for some time and collect Connection Server Role Manager Logs. 

Don't run the renegotiate command. Logs should tell you where the issue is.

For the failed tomcat_connectors, the server is affected by CSCuj57818. Ask your TAC engineer for the workaround. 

What is the TAC SR number? If there is no progress ask the engineer to escalate it. You can also just restart the Connection Server Role Manager Service on the Publisher and see if that helps.

Ok, thanks.

 

The SRM logs indicate the Connection Digital Networking Replication Agent service is not running, however when I start it it stops right away and the cuReplicator log states digital networking is not enabled. 

From SRM Log:

23:47:20.100 |17755,,,SRM,7,<svcmon> checkServiceStatus: started service monitoring
23:47:20.100 |17755,,,SRM,7,<svcmon> Service Status: 1 service(s) not running. Service name(s):
23:47:20.100 |17755,,,SRM,7,<svcmon> Connection Digital Networking Replication Agent
23:47:24.674 |28471,,,SRM,11,<Timer-3> [snd] Type: Heartbeat

 

From Replicator log:

admin:file tail activelog cuc/diag_CuReplicator_00000049.uc
23:42:59.208 HDR|09/14/2014 ,Significant
23:42:59.208 |28914,,,CuReplicator,0,Digital Networking is not enabled. Replicator will stop now.

 

 

There is no digital networking setup to other unity systems, and only one location. 

 

Also, the Server role manager can't be restarted from CLI or the GUI so either root or a server reboot. 

I compared it to another CUC cluster and deactivated the Digital Networking service and the SRM logs seem happier now, will wait a bit and see if it clears the SBR status up. 

This is now resolved.  TAC ended up using root level access (remote account) to redefine the database which solved the problem. 

 

Thanks for your help/pointers on the other items which did help clean up some noise.