DB replication does not work after CUCM upgrade 6.1.1->6.1.3

Answered Question
Jun 7th, 2010

Hi!

There are two CUCM servers, Publisher and Subsciber.

They've run 6.1.1 version and DB replication worked normal. Then I upgraded both servers to 6.1.3 in following way:

1. 6.1.3 was installed to Publisher without switching version.

2. 6.1.3 was installed to Subscriber without switching version.

3. Publisher was rebooted and switched version to 6.1.3.

4. Subscriber was rebooted and switched to 6.1.3.

After that Subs can't sync DB from Publ.

Both servers can ping each other and they are available for IP phones and other hosts.

Result of command "utils dbreplication status" from Pubs:

****************************************************************************************************
This command reads and writes database information from all machines and will take quite some time...please be patient.
****************************************************************************************************

-------------------- utils dbreplication status --------------------

Processing ccmdbtemplate_kem_ccm1_ccm6_1_3_3000_1_1_86_typedberrors with 802 rows group 1
Error returned 99 at 1342
Error returned 99 at 823
command failed -- *Error* <99> is not a known Enterprise Replication error (99)
Output is in file cm/trace/dbl/sdi/ReplicationStatus.2010_06_08_09_10_11.out

Please use "file view activelog cm/trace/dbl/sdi/ReplicationStatus.2010_06_08_09_10_11.out " command to see the output

Result of command "utils dbreplication status" from Subs:

utils dbreplication status output

To determine if replication is suspect, look for the following:
    (1) Number of rows in a table do not match on all nodes.
    (2) Non-zero values occur in any of the other output columns for a table
    (3) ***** PLEASE IGNORE MISMATCHES IN ReplicationDynamic TABLE *****

connect to kem_ccm2_ccm6_1_3_3000_1 failed
Enterprise Replication not active  (62)
command failed -- unable to connect to server specified  (5)

How to fix this problem?

Should I run "utils dbreplication dropadmindb" on the Subs and then "utils dbreplication repair all" on Publ?

I have this problem too.
0 votes
Correct Answer by William Bell about 6 years 5 months ago

OK.  So, if you want, you can check if a reset of the cluster will establish synchronization but it is highly unlikely. Looks like you can load the db on the pub and your replication is just broken.  The last question was if you found any database errors in the install log files.

It looks like your database needs a reset using the command sequences we have discussed.

SUBSCRIBERS

admin: utils dbreplication stop

PUBLISHER

admin:  utils dbreplication stop

PUBLISHER

admin: utils dbreplication reset  all

It may take like 5 minutes to execute the stop and about 30 minutes to do the reset (depending on a) size of db, b) class of servers, c) network latency between cluster nodes).

[Edit] Oh, monitor the progress using RTMT or using the OS command show perf query class "Number of Replicates Created and State of  Replication" periodically.

HTH.

Regards,

Bill

Please remember to rate helpful posts.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 3.3 (4 ratings)
Loading.
David Hailey Mon, 06/07/2010 - 19:26

Don't run the dropadmindb command unless you read up on the troubleshooting techniques and know what is going on there.  The best thing to do typically after an upgrade like this is to first reboot the cluster.  Start with the Publisher and let it come back online and be accessible.  Then reboot the Subscriber and monitor RTMT for DB replication status once both servers are online.  The other way is via the CLI.  The first step is try to repair the DB replication.  For in-depth look at the various methods of troubleshooting, go here:

http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/trouble/7_0_1/tbsystem.html#wp1172876

Disregard that it's a 7x document.  The CLI for DB commands hasn't changed much, if at all, that I know of between versions.

Hailey

Please rate helpful posts!

William Bell Mon, 06/07/2010 - 19:40

I am not familiar with Error 99.  I would check the following before resetting the db.

A) Validate the network first:

1. "show network tech hosts"

Check to see if there are any discrepencies between IP address and/or hostname

2.  "utils diagnose module validate_network"

Check to see if there any errors

B) Check that the db is loaded on the publisher and subscriber nodes:

- "run sql select name from processnode"

You are checking to ensure that you get the CUCM node names or IP addresses from the processnode table.  You will either get data or receive an error that the database failed to load (or similar).  If you get node names, check those against the output from the network validation.

C) Check your install log files.  It is very important that you find out if there was a failure with importing/migrating any of the data from the 6.1(1) system to the 6.1(3) system.

D) Check replication/replicates (on each node):

show perf query class "Number of Replicates Created and State of Replication"

Your next move will depend on what you find out when taking a look at the above information.  Let's find out what is going on first.

Note, if it comes to resetting the db, I haven't used the approach you describe.  I have used the following.

When I have had to reset the database I usually use "utils dbreplication reset all" from the publisher node.  It is recommended that prior to running this command on the publisher that you:

a) use the "utils diagnose module validate_network" command to validate the network

b) use the "utils dbreplications stop" on the subscriber nodes

After succesfully completing the above steps, then use "utils dbreplication reset all".

HTH.


Regards,
Bill

viacheslav.k Mon, 06/07/2010 - 20:09

A) 1. there is no such command "show network tech hosts"

    2. OK.

B) there are IP addresses of Publ and Subs on both servers. OK.

C) how to check it? which command? is it possible to get it to my work pc?

D) Publ:

- Perf class (Number of Replicates Created and State of Replication) has instances and values:
    ReplicateCount  -> Number of Replicates Created   = 393
    ReplicateCount  -> Replicate_State                = 3

Subs:

- Perf class (Number of Replicates Created and State of Replication) has instances and values:
    ReplicateCount  -> Number of Replicates Created   = 0
    ReplicateCount  -> Replicate_State                = 0

Can I run "utils dbreplication reset all" on the Pubs at the working hours? is it affected to CCM proccesses ?

David Hailey Mon, 06/07/2010 - 20:12

The command is show tech network hosts.

You get the install logs using RTMT.  That's the fastest way.

Hailey

Please rate helpful posts!

Correct Answer
William Bell Mon, 06/07/2010 - 20:56

OK.  So, if you want, you can check if a reset of the cluster will establish synchronization but it is highly unlikely. Looks like you can load the db on the pub and your replication is just broken.  The last question was if you found any database errors in the install log files.

It looks like your database needs a reset using the command sequences we have discussed.

SUBSCRIBERS

admin: utils dbreplication stop

PUBLISHER

admin:  utils dbreplication stop

PUBLISHER

admin: utils dbreplication reset  all

It may take like 5 minutes to execute the stop and about 30 minutes to do the reset (depending on a) size of db, b) class of servers, c) network latency between cluster nodes).

[Edit] Oh, monitor the progress using RTMT or using the OS command show perf query class "Number of Replicates Created and State of  Replication" periodically.

HTH.

Regards,

Bill

Please remember to rate helpful posts.

viacheslav.k Mon, 06/07/2010 - 21:06

I've found following in log files:

.......

06/03/2010 16:08:22 CCMInstall|(CAPTURE) command failed -- Enterprise Replication not active  (62)|
06/03/2010 16:08:22 CCMInstall|(CAPTURE) command failed -- The syscdr database is missing!|

.......

William Bell Mon, 06/07/2010 - 20:18

Re:  "show network tech hosts"

Woops.  As Hailey pointed out I typed this in backwards.  The command is "show tech network hosts".  If the "utils diagnose..." command returned error free, then I suspect "show tech network hosts" should as well.

You can collect logs via RTMT or you can use "file list install *"  and then use "file get install" to retrieve a particular file or set of files.  You can use masks to retrieve the files as well.

You do not want to run the dbreplication reset during core business hours, it could definitely affect call processing.

HTH.


Regards,
Bill

viacheslav.k Mon, 06/07/2010 - 20:56

I discovered one thing (I've not noticed it firstly).

The "utils diagnose module validate_network" log from Publ:

admin:utils diagnose module validate_network

Log file: /var/log/active/platform/log/diag2.log

Starting diagnostic test(s)
===========================
test - validate_network    : Reverse DNS lookup failed

However, I use IP addresses instead of servers' names. Moreover, every server has a hosts file with servers IP addresses and names.

Each server can ping another by name or IP address.

System log file:

..............

..............

06-08-2010 10:31:42 validate_network:     checking network [/usr/local/bin/base_scripts/validateNetworking.sh]
06-08-2010 10:35:45 validate_network:     retrieving pub name from [/usr/local/platform/conf/platformConfig.xml]
06-08-2010 10:35:45 validate_network:     Hostname: [kem-ccm1]
06-08-2010 10:35:45 validate_network:     found pub name [kem-ccm1]
06-08-2010 10:35:45 validate_network:     result: 4, message: Reverse DNS lookup failed

On the other hand "utils network ping kem-ccm1" pings the address normally. Moreover, kem-ccm1 is the Publisher itself.

As I understand it can be a reason of DB replication failure.

I'm going to try reset DB replication in the evening.

William Bell Mon, 06/07/2010 - 21:05

Well, I understand that you define your Servers in the CCMAdmin interface by IP address.  This is good practice as it removes DNS from the device registration process.  Not that anything is wrong with DNS, but it is generally considered good practice.  However, that being said, I always recommend that you still ensure that host names and IP addresses are added to forward and reverse lookup zone files.

The validate_network module will fail if you don't have DNS configured but that doesn't necessarily mean replication will fail.  I always use DNS so I can't give you much in the way of "what if" analysis.

Looks like you need to reset the db replication (see my earlier post).  You may want to clean up the DNS if you can.  If not, you can try the replication status.

HTH.

Regards,
Bill

Actions

This Discussion

Related Content