I have UCM Cluster, 3 servers version 126.96.36.1990-13, deployed in Clustering over IP WAN model. Two servers are located in one site(publisher and subscriber 1), the third is at the remote site(subscriber 2).
This configuration worked fine for more than a month but since yesterday I experience the following problems:
- RTMT reports SDLLinkOutOfService alaram
From the Cisco documentation this alaram is defined as: This alert occurs when the SDLLinkOutOfService event gets generated. This event indicates that the local Cisco Unified Communications Manager cannot communicate with the remote Cisco Unified
Communications Manager This event usually indicates network errors or a non-running remote Cisco
Unified Communications Manager.
- I suppose because of the errors above, the database replications gets broken on subscriber 2:
admin:show perf query class "Number of Replicates Created and State of Replication"
==>query class :
- Perf class (Number of Replicates Created and State of Replication) has instances and values:
ReplicateCount -> Number of Replicates Created = 393
ReplicateCount -> Replicate_State = 3
I checked the WAN link and everything seems ok: there is no loss on the link, the Round Trip Time is fine (~30ms), 8002 port, on which the Cisco CallManager service communicates between the nodes is open and accesible from the two sites.
Yesterday I rebooted the subscriber2 server and everything went fine. Today the same problem occured again. I tried rebooting it again, but now not all services are starting (from RTMT: Cisco Callmanager Admin, Cisco CallManager Personal Directory, Cisco CallManager Serviceability and few more). The replication on subscriber2 is broken, too(status code: 3).
So at this point I can't figure out what is the problem. The connection between the sites seems ok. No new configuration is made on the devices. I take a look at the SDL logs from the publisher and I see a lot of errors:
FSM_NO_TRANSITION_DEFINED Description: A transition is not defined for the input signal. Check state machine definition in initStateMachine().signal. Check state machine definition in initStateMachine().
but hinestly I don;t know what they mean.
Do you have any ideias what could be the problem? I attach a sdl log from the publisher server. 10:29:35 is the exact time when the db replication error alarm poped up in RTMT and the error above is loged in the trace.
From the SDL trace we can see at 10:29:35 connectivity between node 1 and node 3 was restored with corresponding SdlLinkISV signals
and further signals exchange.
Once connectivity restored replication check process occured and information mismatch in tables was found resulting in replication status 3.
Facts above pointing to the network issue.
To proof/decline it you can capture sniffer traces from both nodes either with internal capture tool (utils network capture) or with external pc
(wireshark software) during the time you experience this issue.
If you can ping/telnet to 8002 port it does not necessary mean you don't have problems in the network.
In production it's often happening that several routes exist between 2 hosts for redundancy purposes and not all of them can transport big packets.