cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
6523
Views
32
Helpful
24
Replies

CUCM SERVICE RESTART ON CLUSTER

a.gooding
Level 5
Level 5

Hi everyone,

having a cluster issue and it seems to be indirectly pointing to something with the call manager service.

This is the scenario;

1. I have multiple locations with Metro WAN connections of 4MB or more.

2. We have Four CUCM 6.1.2 servers - 1 publisher , three subs (Call processing Subscribers) - We have four major sites including head office so i have three major device pools to do as much load balancing as possible with phones registering the the sub at thier local site.

The issue

If a WAN link goes down at any of the locations that have a Subscriber, everything works as expected, however when the WAN link comes back into service, my PUBLISHER at head office start dropping ALL phones that are registered to it. The phones goes into a configuring IP state and will stay there.

My work around

For now the only way i have of fixing this is to restart the Call Manager service on the publisher. As soon as this is done everything works fine phones register and works.

Other information

RTT per site -approx 20 ms

DB replication - RTMT gives number of 2

thanks in advance

1 Accepted Solution

Accepted Solutions

You should consider having your primary CCM in the HQ group be the subscriber and the Publisher be the secondary.  It does not appear you do this today but it is best practice to do so.  With the small number of phones you have, this wouldn't be a problem and you are running 7845's.  Ideally, your Publisher should be the database server.  Many times, it is the DB/TFTP and that's fine.  It's also OK for it to run CCM services but it should not be a primary call processing agent.  Quickly, I'd look at doing something such as this:

HQ CM Group

1. Subscriber

2. Publisher

Primary TFTP = Publisher for this site

For other sites, I would likely consider something like this:

Remote A CM Group

1. Local Subscriber

2. HQ Subscriber

3. Remote B Subscriber

Remote B CM Group

1. Local Subscriber

2. HQ Subscriber

3. Remote A Subscriber

However, QoS should be implemented - especially if the traffic profile is that there is a lot of intersite calling from HQ to Remote A and B.  800 phones in one location making calls to the other 2 sites with no prioritization on a 4 MB link could be asking for trouble at some point.  Your local TFTP setup is fine.  Different mgmt but shouldn't be an issue as long as you have DHCP Option 150 configured properly.

Regarding Unity at Remote site - is that used for all the company users?  If so, that factors into traffic profile especially for HQ to the remote site where Unity is.  The SDL Link OOS in logs indicates that the underlying replication link to a node or nodes was lost - i.e., Publisher cannot communicate with Sub1 or whatever the case may be.  Phones on that server would then have to failover to their secondary.

FYI: I have also once ran into an issue during an upgrade where phones had issues registering when they were associated to existing device pools that contained default/first configured CM groups.  In that case, I had to create new CM groups and update the DP associations and all registration issues went away.

HTH

Hailey

View solution in original post

24 Replies 24

markbatts
Level 1
Level 1

Hi,

Can i just clarify one thing .

The phones that go "configuring IP" is there primary callmanager the Pub or the Sub that has just become available.

cheers

Mark

Hi Mark,

thanks for responding. strangely enough the same thing is happening again this morning. ive opened a TAC case.

to answer your question, all phones have a Pub and two Subs that are available. The ones that are saying configuring IP has two subs, one that is in thier same building, and one that is at a standby site.

Ive restarted the Call Manager service again just now and some phones that were saying configuring IP came up, but some still have not. ive just restarted the Publisher completely and the phones are saying registering. ive checked the phone setting

     1. under network config - TFTP server 1 = pub, TFTP server 2 = Sub

     2. under device - call manager configuration - call manger active - Pub, second - Sub one Third Sub 2 - Fourth - Local Gateway.

Im trying to understand this myself to be honest so im hoping that my explanations are coming across properly.

I'd like to understand your design/deployment model a bit better.  You have a Pub and I believe original thread stated 3 subs.  However, it appears that the primary CCM in the CM groups is the Publisher.  I would recommend against this approach.  Typically, it is best to keep the Pub to DB/HTTP services as much as possible - it may run CCM for backup purposes and possibly be a TFTP primary, but would you mind providing more detail on how on your CM group configurations and even (if you can) a generic diagram of architecture.  No need to share anything confidential.

Hailey

Hi,

have a look at the attached diagram i quickly drew up. i have not stopped any services on any of the systems and all are on different networks. The history of this is as follows

Pub/sub setup about two years - No Problem

Branch A sub setup last year - Issue of Call Manager serivce started to occur early january of this year - Once only

Branch B setup last month - Issue occuring as of date of this post and is occuring again today. Normally restarting the Call Manager service will fix however today some phones will register and then de-register.

Symptoms occuring today

1. Phones deregister although there is a backup

2. phones that are registered remain registered for some time and then drop off

3. phones that register give no dialtone for some time and then drop back off

ive just gotten a report that the phones are finally registered, however they are experiencing one way audio. This exact thing has happened previously and im about to restart the call manager service once more (It sounds like routing or network i know but there are too much other variables to pick the network as the starting point to troubleshoot at this point in time)

As said, i just restarted the Call manager service one more time and everthing is back up. Any assistnace with this would be greatly appreciated of course.

Just went back down

:TIME 206 - service restarted again about half hour ago and things are back up

I know it's not fun to have this issue, but it is an interesting one:

1.  Can you tell me which server is Head Office Primary (is this Pub or Sub) as that will tell me what the secondary is for the Head Office.

2.  What model servers are you running at each location?

3.  TFTP -you may want to consider splitting TFTP across sites as well (this is allowed) or consider trusted relay point for the remote sites if they need to pull configuration.  This will reduce BW on the WAN.

It does sound as if you have some network-related issues at play here as well.  Like some sort of routing loop of sorts where when a link flaps and comes back online, you experience issues at the HQ site...but that is mere speculation.  The description of the problem is unique for sure.  Do you have QoS configured throughout the network?

Out of curiousity, what role has TAC played thus far.  In other words, what traces and such have they asked to see.

Its always a learning process i guess but id

rather have this one not happen

to answer the questions

1. Server Pub is head office primary server (Note all we did to cluster is install each onto each network and ensured replication occured. We did some failover testing as well but we did not stop any services. Also RTT and bandwidth we adequately catered for IMO)

2.MCS-7845 servers with

CUCM 6.1.2.1106-1

3. Is exactly how we have it setup. Each phone group within the three sites are setup to access the local the TFTP server. All servers are running thier own TFTP services (I was concerned that this is what was causing it to be honest) so Branch A points to BRANCH A tftp etc.

I was suspecting the same thing for a while now to be honest but its only the phone system that seems to be affected (Mind you this is 16 plus sites). TAC has sent me my second email and im trying to explain the same things to him as well. No traces at yet, ive done multiple restarts and im done with the restarts for now, ive checked RTMT and im seeing SDllinklinkOutofSerivce every now and again.

Are there any services i should have stopped to ensure that we follow the proper guidelines?.

PS. QoS is not implemented - (I know and its a big deal, i know, but it something that we have strongly recommended to them and since its a government entitity its taking too long. To be perfectly honest, a report needs to come out of this which may stress on the QoS factor as well)

Another update that i sent to TAC as of a few minutes ago

Ive just confirmed,

Branch A and B do not go down. Its only phones registered to the Pub itself. If branch A or B servers are rebooted (Which we did) they register to or attempt to register to the Pub and it seems as though they fall into the same situation when they try to do that. (I cannot confirm this as it is a user report and i am not on site)

We have restarted the Call Manager service again and most sites are up. Phones that usually register with the PUB is now registering with the Sub with the exception of CTI route points which are registered and working with the Pub.

Update # 4 million...

Calls could not be made out the local H323 gateway off the Pub. Getting Fast Busy.

Work around : Restart the Call Manager Service.

Comments: im thinking of changing the Device Pool to register with the Sub instead of the Pub now.

You should consider having your primary CCM in the HQ group be the subscriber and the Publisher be the secondary.  It does not appear you do this today but it is best practice to do so.  With the small number of phones you have, this wouldn't be a problem and you are running 7845's.  Ideally, your Publisher should be the database server.  Many times, it is the DB/TFTP and that's fine.  It's also OK for it to run CCM services but it should not be a primary call processing agent.  Quickly, I'd look at doing something such as this:

HQ CM Group

1. Subscriber

2. Publisher

Primary TFTP = Publisher for this site

For other sites, I would likely consider something like this:

Remote A CM Group

1. Local Subscriber

2. HQ Subscriber

3. Remote B Subscriber

Remote B CM Group

1. Local Subscriber

2. HQ Subscriber

3. Remote A Subscriber

However, QoS should be implemented - especially if the traffic profile is that there is a lot of intersite calling from HQ to Remote A and B.  800 phones in one location making calls to the other 2 sites with no prioritization on a 4 MB link could be asking for trouble at some point.  Your local TFTP setup is fine.  Different mgmt but shouldn't be an issue as long as you have DHCP Option 150 configured properly.

Regarding Unity at Remote site - is that used for all the company users?  If so, that factors into traffic profile especially for HQ to the remote site where Unity is.  The SDL Link OOS in logs indicates that the underlying replication link to a node or nodes was lost - i.e., Publisher cannot communicate with Sub1 or whatever the case may be.  Phones on that server would then have to failover to their secondary.

FYI: I have also once ran into an issue during an upgrade where phones had issues registering when they were associated to existing device pools that contained default/first configured CM groups.  In that case, I had to create new CM groups and update the DP associations and all registration issues went away.

HTH

Hailey

ive already reassociated  the Sub as the primary . Ill need to do a reset later this evening and see if that fixes the issue. Also, the UNITY is only local for that branch, no one else uses it. I have it configured to register with its branch Subscriber and then if that fails register with the others.

Im addressing the other groups as well and will defintely be informed.

Two questions if you dont mind at all

1. Shall i stop the TFTP service on the Sub at the main location? Are there any other recommended services to stop as well for any of the servers?

2. Do you have any sample router QoS configuration i can follow to apply

i really do appreciate the assistance here.

PS. ill be buying you a truck load of beers for this one

Hi Arvind,

Long time my friend!

Just to add a link to the great help from Hailey (+5 for you sir!)



I thought that you might find this useful (Table 16);

Cluster Service Activation Node Recommendations

The "Configuring Services" chapter in the Cisco Unified Serviceability Administration Guide does not include the following information that describes service activation recommendations for specific nodes in a cluster. Table 16 provides a general summary of the cluster activation recommendations for a feature service in these nodes: publisher, subscriber, TFTP, and MOH. For specific recommendations that are associated with activating a particular feature service, refer to the Cluster Service Activation Recommendations section in the "Configuring Services" chapter.


http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/rel_notes/7_1_3/cucm-rel_notes-713.html#wp1791070

Cheers!

Rob

Rob,

great to hear from you as always and hope things are better on your side than might right now .:). i think im getting too old for this

Thanks for the info and ill be doing some extra reading up this evening.

ill keep you guys informed.

Any QoS assistance on the WAN side would be beerly appreciated......:)

Hey Arvind,

It's like the old Dylan song "Everything is broken", working with TAC

on a CUC Publisher "out of service".

Cheers and good luck!

Rob

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: