I have UCM Cluster, 3 servers version 126.96.36.1990-13, deployed in Clustering over IP WAN model. Two servers are located in one site(publisher and subscriber 1), the third is at the remote site(subscriber 2).
This configuration worked fine for more than a month but since yesterday I experience the following problems:
- RTMT reports SDLLinkOutOfService alaram
From the Cisco documentation this alaram is defined as: This alert occurs when the SDLLinkOutOfService event gets generated. This event indicates that the local Cisco Unified Communications Manager cannot communicate with the remote Cisco Unified Communications Manager This event usually indicates network errors or a non-running remote Cisco Unified Communications Manager.
- I suppose because of the errors above, the database replications gets broken on subscriber 2:
admin:show perf query class "Number of Replicates Created and State of Replication" ==>query class :
- Perf class (Number of Replicates Created and State of Replication) has instances and values: ReplicateCount -> Number of Replicates Created = 393 ReplicateCount -> Replicate_State = 3
I checked the WAN link and everything seems ok: there is no loss on the link, the Round Trip Time is fine (~30ms), 8002 port, on which the Cisco CallManager service communicates between the nodes is open and accesible from the two sites.
Yesterday I rebooted the subscriber2 server and everything went fine. Today the same problem occured again. I tried rebooting it again, but now not all services are starting (from RTMT: Cisco Callmanager Admin, Cisco CallManager Personal Directory, Cisco CallManager Serviceability and few more). The replication on subscriber2 is broken, too(status code: 3).
So at this point I can't figure out what is the problem. The connection between the sites seems ok. No new configuration is made on the devices. I take a look at the SDL logs from the publisher and I see a lot of errors:
FSM_NO_TRANSITION_DEFINED Description: A transition is not defined for the input signal. Check state machine definition in initStateMachine().signal. Check state machine definition in initStateMachine().
but hinestly I don;t know what they mean.
Do you have any ideias what could be the problem? I attach a sdl log from the publisher server. 10:29:35 is the exact time when the db replication error alarm poped up in RTMT and the error above is loged in the trace.
thanks for the reply. In fact yesterday we isolated the problem to be network issue.
There are two tunnels between the central office and the remote site. The primary tunnel was down, so the traffic was going through the secondary tunnel. The problem here is, due to a wrong configuration on the routers in the central office, the traffic was traversing some other devices in third offices of the client, where the link capacity is low(2Mb). I suspect that during peak hours the latency on the link went too high which started to reflect on the cluster.
In fact the utils network connectivity command reported the following:
admin:utils network connectivity
This command can take up to 3 minutes to complete. Continue (y/n)?y Running test, please wait ...
Test failed: Could not receive TCP packets from publisher. admin:
Also ping with big packets(over 1500 bytes) reported loss on the link.
Anyway after bringing the primary tunnel up, the communication in the cluster was restored and the db replication is fine now.
SIP traces provide key information in troubleshooting SIP Trunks, SIP
endpoints and other SIP related issues. Even though these traces are in
clear text, these texts can be gibberish unless you understand fully
what they mean. This document attempts to br...
Please find the attached HTML document, download and open it on your PC.
This provides an easy to use form where you simply answer a few
questions and it will render the proper jabber-config.xml file for you
to copy/paste. There is built in logic to verif...
[toc:faq]CUCM Database Replication is an area in which Cisco customers
and partners have asked for more in-depth training in being able to
properly assess a replication problem and potentially resolve an issue
without involving TAC. This document discusse...