Steps to Troubleshoot CUCM Database Replication Problems in 5.x and 6.x
(written by Bill Benninghoff, heavily borrowing from material written by Laurie Dotter and Nancy Balsbaugh)
If installation has proceeded correctly, then the informix cdr service should be running on the publisher and on each sub in the cluster.“Cdr” in this context means “Continuous Data Replication”, not call detail records.
In order to setup replication, scripts run during the install process that do these things:
a.define replication on the pub
b.define the template on the pub and realize it (tells pub what to replicate)
c.define replication for each sub
d.realize the template on each sub
e.synch the data between the pub and subs using “cdr sync” or “cdr check”
It is possible that this process broke down at one of the steps.
If you look at the RTMT replication counter and see that the replication state counter is a 3 or a 4 for a given server that means replication has failed for that server.
Here are some suggested steps to troubleshoot replication.
Check the replication status using the following command logged in on the pub as admin:
utils db replication status
This will generate an output file.Study the file to see if replication is setup to each server and if the data is in synch among the servers.
This section above means that there are no rows missing between the databases on the two servers.They are in perfect synch.
Use “utils dbreplication repair all”command if replication is set up, but some tables are out of sync.If only one sub is out of sync, you can run this on one node, else use “utils dbreplication repair all” to fix it for all nodes
Here is an example of a problem with replication from the ouptut file:
If you see that a server’s status is “Dropped” or “Quiescent” or just missing from the table, then you will need to troubleshoot the network connection between the pub and subs.
Another useful diagnostic command is “cdr list serv”.You have to be root to run this command and you can run it on the pub and on each sub to show which servers have been defined from the perspective of the server you are on, and what state those defined servers are in.Here is an example of the output of that command:
The status of “Local” or “Active” is good.A bad status would be “Dropped” or “Quiescent”.If the server is missing from this list then it is not yet defined.
Now assuming that one of your subs is missing or dropped from the list, the first thing to look at is possible network errors.Do the following to test the network connection:
ping the pub from the sub with a large amount of data:
ping <pub name> -s 1500
ping the sub from the pub with a large amt of data:
ping <sub name> -s 1500
Verify that cluster manager (clm) (or ipsec_mgr in 5.1.2 and earlier) is responding to the host by analyzing /var/log/active/platform/log/clustermgr* logs. (It is platform_mgr in earlier loads).
Clm is responsible for adding hosts to the iptables rules. clm on sub and clm on pub exchange handshakes.Clm on the pub puts the sub in the policy injected state and adds the host to the iptables rules allowing replication to work.So, if iptables is blocking replication, the clm's are not talking. Clm communicates over 8500/udp and often times with large packets which means they are fragmented. If pmtu discovery is broken (ie., icmp packets are dropped/not sent) or fragments are not allowed through the network then clm does not communicate, iptables is not open, and as a result replication does not work.
In the clm logs on the pub look for entries about communications with the sub, most importantly one saying that the sub was put into policy injected state.
Make sure that the dbl rpc service is running on the subs and on the pub.Do this by typing this command as root on the subs where replication is not working:
dbl rpchello <pub name>
If that command returns an error then check to see if the dbl rpc service is DBLrunning by doing this:
ps –ef | grep dblrpc
If you don’t see anything that is a problem.Dblrpc must be running on the sub in order for replication to be setup.Once replication is established dblrpc no longer needs to run.
To start up the dbl rpc service on the sub do this as root:
controlcenter.sh"A Cisco DB Replicator"start
Another thing you may need to do in order to enable the sub and the pub informix processes to talk with each other is to turn off the Linux firewall which is done with the iptables command as root:
As user informix you can start the following command shell by typing this:
This runs a program in which you can select “connnect” and try to connect to the informix database on each server in the cluster.If you are able to connect from the sub to the pub then the network connection is good and your problem is something else.
Check the log files on the pub and subs:
look at these four files and make sure the entries in these files on the pub match the entries in these files on the subs:
2./etc/services (very bottom of the file)
In particular, make sure that in the sqlhosts file there is only one entry for each node in the cluster.
When you are sure that the network connectivity is working run the following commands to establish the replication:
a.on the sub run this as admin:utils dbreplication stop
b.on the pub run this as admin:utils dbreplication stop
c.on the pub run this as admin:utils dbreplication reset <name of sub that is not working>
Check if replication is now working:
go to this directory as root on the pub:
Run this command :
This will list all the files in the directory in with the most recent files at the bottom.Scan that list to see if there is a file in there with the word “define” in the filename and also the name of the server or servers that are having trouble.