Solved: CM 6.1 Clustering over IP WAN problems

Vladimir Stankov · ‎01-21-2010

Hi,

I have UCM Cluster, 3 servers version 6.1.2.1000-13, deployed in Clustering over IP WAN model. Two servers are located in one site(publisher and subscriber 1), the third is at the remote site(subscriber 2).

This configuration worked fine for more than a month but since yesterday I experience the following problems:

- RTMT reports SDLLinkOutOfService alaram

     From the Cisco documentation this alaram is defined as: This alert occurs when the SDLLinkOutOfService event gets generated. This event indicates      that the local Cisco Unified Communications Manager cannot communicate with the remote Cisco Unified
      Communications Manager This event usually indicates network errors or a non-running remote Cisco
      Unified Communications Manager.

- I suppose because of the errors above, the database replications gets broken on subscriber 2:

admin:show perf query class "Number of Replicates Created and State of Replication"
==>query class :

- Perf class (Number of Replicates Created and State of Replication) has instances and values:
ReplicateCount -> Number of Replicates Created = 393
ReplicateCount -> Replicate_State = 3

I checked the WAN link and everything seems ok: there is no loss on the link, the Round Trip Time is fine (~30ms), 8002 port, on which the Cisco CallManager service communicates between the nodes is open and accesible from the two sites.

Yesterday I rebooted the subscriber2 server and everything went fine. Today the same problem occured again. I tried rebooting it again, but now not all services are starting (from RTMT: Cisco Callmanager Admin, Cisco CallManager Personal Directory, Cisco CallManager Serviceability and few more). The replication on subscriber2 is broken, too(status code: 3).

So at this point I can't figure out what is the problem. The connection between the sites seems ok. No new configuration is made on the devices. I take a look at the SDL logs from the publisher and I see a lot of errors:

FSM_NO_TRANSITION_DEFINED Description: A transition is not defined for the input signal. Check state machine definition in initStateMachine().signal. Check state machine definition in initStateMachine().

but hinestly I don;t know what they mean.

Do you have any ideias what could be the problem? I attach a sdl log from the publisher server. 10:29:35 is the exact time when the db replication error alarm poped up in RTMT and the error above is loged in the trace.

Vladimir Savostin · ‎01-25-2010

Hello Vladimir,

From the SDL trace we can see at 10:29:35 connectivity between node 1 and node 3 was restored with corresponding SdlLinkISV signals

and further signals exchange.

Once connectivity restored replication check process occured and information mismatch in tables was found resulting in replication status 3.

Facts above pointing to the network issue.

To proof/decline it you can capture sniffer traces from both nodes either with internal capture tool (utils network capture) or with external pc

(wireshark software) during the time you experience this issue.

If you can ping/telnet to 8002 port it does not necessary mean you don't have problems in the network.

In production it's often happening that several routes exist between 2 hosts for redundancy purposes and not all of them can transport big packets.

View solution in original post

Vladimir Savostin · ‎01-25-2010

Hello Vladimir,

From the SDL trace we can see at 10:29:35 connectivity between node 1 and node 3 was restored with corresponding SdlLinkISV signals

and further signals exchange.

Once connectivity restored replication check process occured and information mismatch in tables was found resulting in replication status 3.

Facts above pointing to the network issue.

To proof/decline it you can capture sniffer traces from both nodes either with internal capture tool (utils network capture) or with external pc

(wireshark software) during the time you experience this issue.

If you can ping/telnet to 8002 port it does not necessary mean you don't have problems in the network.

In production it's often happening that several routes exist between 2 hosts for redundancy purposes and not all of them can transport big packets.

Vladimir Stankov · ‎01-26-2010

Hi Vladimir,

thanks for the reply. In fact yesterday we isolated the problem to be network issue.

In brief:

There are two tunnels between the central office and the remote site. The primary tunnel was down, so the traffic was going through the secondary tunnel. The problem here is, due to a wrong configuration on the routers in the central office, the traffic was traversing some other devices in third offices of the client, where the link capacity is low(2Mb). I suspect that during peak hours the latency on the link went too high which started to reflect on the cluster.

In fact the utils network connectivity command reported the following:

admin:utils network connectivity

This command can take up to 3 minutes to complete.
Continue (y/n)?y
Running test, please wait ...

Test failed: Could not receive TCP packets from publisher.
admin:

Also ping with big packets(over 1500 bytes) reported loss on the link.

Anyway after bringing the primary tunnel up, the communication in the cluster was restored and the db replication is fine now.

Regards,

Vladimir