Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Bronze

CUCM SERVICE RESTART ON CLUSTER

Hi everyone,

having a cluster issue and it seems to be indirectly pointing to something with the call manager service.

This is the scenario;

1. I have multiple locations with Metro WAN connections of 4MB or more.

2. We have Four CUCM 6.1.2 servers - 1 publisher , three subs (Call processing Subscribers) - We have four major sites including head office so i have three major device pools to do as much load balancing as possible with phones registering the the sub at thier local site.

The issue

If a WAN link goes down at any of the locations that have a Subscriber, everything works as expected, however when the WAN link comes back into service, my PUBLISHER at head office start dropping ALL phones that are registered to it. The phones goes into a configuring IP state and will stay there.

My work around

For now the only way i have of fixing this is to restart the Call Manager service on the publisher. As soon as this is done everything works fine phones register and works.

Other information

RTT per site -approx 20 ms

DB replication - RTMT gives number of 2

thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions

Re: CUCM SERVICE RESTART ON CLUSTER

You should consider having your primary CCM in the HQ group be the subscriber and the Publisher be the secondary.  It does not appear you do this today but it is best practice to do so.  With the small number of phones you have, this wouldn't be a problem and you are running 7845's.  Ideally, your Publisher should be the database server.  Many times, it is the DB/TFTP and that's fine.  It's also OK for it to run CCM services but it should not be a primary call processing agent.  Quickly, I'd look at doing something such as this:

HQ CM Group

1. Subscriber

2. Publisher

Primary TFTP = Publisher for this site

For other sites, I would likely consider something like this:

Remote A CM Group

1. Local Subscriber

2. HQ Subscriber

3. Remote B Subscriber

Remote B CM Group

1. Local Subscriber

2. HQ Subscriber

3. Remote A Subscriber

However, QoS should be implemented - especially if the traffic profile is that there is a lot of intersite calling from HQ to Remote A and B.  800 phones in one location making calls to the other 2 sites with no prioritization on a 4 MB link could be asking for trouble at some point.  Your local TFTP setup is fine.  Different mgmt but shouldn't be an issue as long as you have DHCP Option 150 configured properly.

Regarding Unity at Remote site - is that used for all the company users?  If so, that factors into traffic profile especially for HQ to the remote site where Unity is.  The SDL Link OOS in logs indicates that the underlying replication link to a node or nodes was lost - i.e., Publisher cannot communicate with Sub1 or whatever the case may be.  Phones on that server would then have to failover to their secondary.

FYI: I have also once ran into an issue during an upgrade where phones had issues registering when they were associated to existing device pools that contained default/first configured CM groups.  In that case, I had to create new CM groups and update the DP associations and all registration issues went away.

HTH

Hailey

24 REPLIES
Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Hi,

Can i just clarify one thing .

The phones that go "configuring IP" is there primary callmanager the Pub or the Sub that has just become available.

cheers

Mark

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Hi Mark,

thanks for responding. strangely enough the same thing is happening again this morning. ive opened a TAC case.

to answer your question, all phones have a Pub and two Subs that are available. The ones that are saying configuring IP has two subs, one that is in thier same building, and one that is at a standby site.

Ive restarted the Call Manager service again just now and some phones that were saying configuring IP came up, but some still have not. ive just restarted the Publisher completely and the phones are saying registering. ive checked the phone setting

     1. under network config - TFTP server 1 = pub, TFTP server 2 = Sub

     2. under device - call manager configuration - call manger active - Pub, second - Sub one Third Sub 2 - Fourth - Local Gateway.

Im trying to understand this myself to be honest so im hoping that my explanations are coming across properly.

Re: CUCM SERVICE RESTART ON CLUSTER

I'd like to understand your design/deployment model a bit better.  You have a Pub and I believe original thread stated 3 subs.  However, it appears that the primary CCM in the CM groups is the Publisher.  I would recommend against this approach.  Typically, it is best to keep the Pub to DB/HTTP services as much as possible - it may run CCM for backup purposes and possibly be a TFTP primary, but would you mind providing more detail on how on your CM group configurations and even (if you can) a generic diagram of architecture.  No need to share anything confidential.

Hailey

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Hi,

have a look at the attached diagram i quickly drew up. i have not stopped any services on any of the systems and all are on different networks. The history of this is as follows

Pub/sub setup about two years - No Problem

Branch A sub setup last year - Issue of Call Manager serivce started to occur early january of this year - Once only

Branch B setup last month - Issue occuring as of date of this post and is occuring again today. Normally restarting the Call Manager service will fix however today some phones will register and then de-register.

Symptoms occuring today

1. Phones deregister although there is a backup

2. phones that are registered remain registered for some time and then drop off

3. phones that register give no dialtone for some time and then drop back off

ive just gotten a report that the phones are finally registered, however they are experiencing one way audio. This exact thing has happened previously and im about to restart the call manager service once more (It sounds like routing or network i know but there are too much other variables to pick the network as the starting point to troubleshoot at this point in time)

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

As said, i just restarted the Call manager service one more time and everthing is back up. Any assistnace with this would be greatly appreciated of course.

Just went back down

:TIME 206 - service restarted again about half hour ago and things are back up

Re: CUCM SERVICE RESTART ON CLUSTER

I know it's not fun to have this issue, but it is an interesting one:

1.  Can you tell me which server is Head Office Primary (is this Pub or Sub) as that will tell me what the secondary is for the Head Office.

2.  What model servers are you running at each location?

3.  TFTP -you may want to consider splitting TFTP across sites as well (this is allowed) or consider trusted relay point for the remote sites if they need to pull configuration.  This will reduce BW on the WAN.

It does sound as if you have some network-related issues at play here as well.  Like some sort of routing loop of sorts where when a link flaps and comes back online, you experience issues at the HQ site...but that is mere speculation.  The description of the problem is unique for sure.  Do you have QoS configured throughout the network?

Out of curiousity, what role has TAC played thus far.  In other words, what traces and such have they asked to see.

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Its always a learning process i guess but id

rather have this one not happen

to answer the questions

1. Server Pub is head office primary server (Note all we did to cluster is install each onto each network and ensured replication occured. We did some failover testing as well but we did not stop any services. Also RTT and bandwidth we adequately catered for IMO)

2.MCS-7845 servers with

CUCM 6.1.2.1106-1

3. Is exactly how we have it setup. Each phone group within the three sites are setup to access the local the TFTP server. All servers are running thier own TFTP services (I was concerned that this is what was causing it to be honest) so Branch A points to BRANCH A tftp etc.

I was suspecting the same thing for a while now to be honest but its only the phone system that seems to be affected (Mind you this is 16 plus sites). TAC has sent me my second email and im trying to explain the same things to him as well. No traces at yet, ive done multiple restarts and im done with the restarts for now, ive checked RTMT and im seeing SDllinklinkOutofSerivce every now and again.

Are there any services i should have stopped to ensure that we follow the proper guidelines?.

PS. QoS is not implemented - (I know and its a big deal, i know, but it something that we have strongly recommended to them and since its a government entitity its taking too long. To be perfectly honest, a report needs to come out of this which may stress on the QoS factor as well)

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Another update that i sent to TAC as of a few minutes ago

Ive just confirmed,

Branch A and B do not go down. Its only phones registered to the Pub itself. If branch A or B servers are rebooted (Which we did) they register to or attempt to register to the Pub and it seems as though they fall into the same situation when they try to do that. (I cannot confirm this as it is a user report and i am not on site)

We have restarted the Call Manager service again and most sites are up. Phones that usually register with the PUB is now registering with the Sub with the exception of CTI route points which are registered and working with the Pub.

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Update # 4 million...

Calls could not be made out the local H323 gateway off the Pub. Getting Fast Busy.

Work around : Restart the Call Manager Service.

Comments: im thinking of changing the Device Pool to register with the Sub instead of the Pub now.

Re: CUCM SERVICE RESTART ON CLUSTER

You should consider having your primary CCM in the HQ group be the subscriber and the Publisher be the secondary.  It does not appear you do this today but it is best practice to do so.  With the small number of phones you have, this wouldn't be a problem and you are running 7845's.  Ideally, your Publisher should be the database server.  Many times, it is the DB/TFTP and that's fine.  It's also OK for it to run CCM services but it should not be a primary call processing agent.  Quickly, I'd look at doing something such as this:

HQ CM Group

1. Subscriber

2. Publisher

Primary TFTP = Publisher for this site

For other sites, I would likely consider something like this:

Remote A CM Group

1. Local Subscriber

2. HQ Subscriber

3. Remote B Subscriber

Remote B CM Group

1. Local Subscriber

2. HQ Subscriber

3. Remote A Subscriber

However, QoS should be implemented - especially if the traffic profile is that there is a lot of intersite calling from HQ to Remote A and B.  800 phones in one location making calls to the other 2 sites with no prioritization on a 4 MB link could be asking for trouble at some point.  Your local TFTP setup is fine.  Different mgmt but shouldn't be an issue as long as you have DHCP Option 150 configured properly.

Regarding Unity at Remote site - is that used for all the company users?  If so, that factors into traffic profile especially for HQ to the remote site where Unity is.  The SDL Link OOS in logs indicates that the underlying replication link to a node or nodes was lost - i.e., Publisher cannot communicate with Sub1 or whatever the case may be.  Phones on that server would then have to failover to their secondary.

FYI: I have also once ran into an issue during an upgrade where phones had issues registering when they were associated to existing device pools that contained default/first configured CM groups.  In that case, I had to create new CM groups and update the DP associations and all registration issues went away.

HTH

Hailey

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

ive already reassociated  the Sub as the primary . Ill need to do a reset later this evening and see if that fixes the issue. Also, the UNITY is only local for that branch, no one else uses it. I have it configured to register with its branch Subscriber and then if that fails register with the others.

Im addressing the other groups as well and will defintely be informed.

Two questions if you dont mind at all

1. Shall i stop the TFTP service on the Sub at the main location? Are there any other recommended services to stop as well for any of the servers?

2. Do you have any sample router QoS configuration i can follow to apply

i really do appreciate the assistance here.

PS. ill be buying you a truck load of beers for this one

Hall of Fame Super Red

Re: CUCM SERVICE RESTART ON CLUSTER

Hi Arvind,

Long time my friend!

Just to add a link to the great help from Hailey (+5 for you sir!)



I thought that you might find this useful (Table 16);

Cluster Service Activation Node Recommendations

The "Configuring Services" chapter in the Cisco Unified Serviceability Administration Guide does not include the following information that describes service activation recommendations for specific nodes in a cluster. Table 16 provides a general summary of the cluster activation recommendations for a feature service in these nodes: publisher, subscriber, TFTP, and MOH. For specific recommendations that are associated with activating a particular feature service, refer to the Cluster Service Activation Recommendations section in the "Configuring Services" chapter.


http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/rel_notes/7_1_3/cucm-rel_notes-713.html#wp1791070

Cheers!

Rob

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Rob,

great to hear from you as always and hope things are better on your side than might right now .:). i think im getting too old for this

Thanks for the info and ill be doing some extra reading up this evening.

ill keep you guys informed.

Any QoS assistance on the WAN side would be beerly appreciated......:)

Hall of Fame Super Red

Re: CUCM SERVICE RESTART ON CLUSTER

Hey Arvind,

It's like the old Dylan song "Everything is broken", working with TAC

on a CUC Publisher "out of service".

Cheers and good luck!

Rob

Re: CUCM SERVICE RESTART ON CLUSTER

Rob,

The man, the myth, the legend.  Thanks for the rating.  It is always appreciated.

Arvind,

I hope to have been of help to you today.  I have some other thoughts and will post them; however, I need to make this post quick duesome work I have got to get moving along on.

Rob's link is a great info source.  Cisco has recommended best practices on which service should/should not run on each node in the cluster based on role.  In your current configuration, I think you will find that once the dust settles - letting your Pub be a pub and your Subs be subs will lead to much better performance and stability within the CUCM system.  I would also start thinking about an upgrade to 7x.  Given where you're at, you may need to do multi-step upgrade (don't recall exact version you listed) but there were still a lot of bugs in the 6x code.  If you are going to stay on 6x for whatever reason for some time, it would be prudent to consider at least getting to 6.1(3) as we have found that train of code has been very stable (bug-wise) for a number of much larger and even small customers.

If you have some time and just want to learn some things, check out this blog that my colleague wrote regarding TFTP and firmware distribution options.  It's quite informational and gives some things you might want to think about if you are ever in a situtation where your WAN sites start to grow in size (half or equal to) your HQ site.

http://www.netcraftsmen.net/resources/blogs/options-for-distributing-cisco-phone-firmware.html?blogger=William+Bell

Hailey

Re: CUCM SERVICE RESTART ON CLUSTER

Ok, your big topic - QoS.  I'm going to give you an honest answer based on what I am assuming may be applicable to your environment and what I know of it's topology (all be it, in a limited fashion).  If you have an all Cisco environment AND you or possibly your customer (not sure who is managing the network for you here) has constraints in time, resources, or experience - whatever the case may be, I would strongly suggest that you take a gander at what Auto QoS can do for you.  It's not for everyone or every deployment but it might be just enough for you.  Deploying QoS is a complex topic.  There are many scenarios facets to understand and each Cisco platform can have different capabilities or limitations in how QoS is implemented. In any case, Auto QoS is better than no QoS and don't let anyone tell you otherwise unless you have unlimited bandwidth all over the place (which you don't).  So, here's some really good links to get you started:


http://www.cisco.com/en/US/technologies/tk543/tk879/technologies_qas0900aecd8020a589.html


http://www.cisco.com/en/US/tech/tk543/tk759/technologies_white_paper09186a00801348bc.shtml


http://articles.techrepublic.com.com/5100-10878_11-6134065.html


HTH and good luck to you.


BTW, does Rob get the truckload of beer or me?  If nothing else, can we at least share?

Hall of Fame Super Red

Re: CUCM SERVICE RESTART ON CLUSTER

Hailey,

You deserve all the Beer my friend Your participation here

and the quality of your answers are fantastic! I could really use a couple

of "cool ones" right now, but I'll have to wait. Maybe on the weekend

we can all have a virtual pint together.

Cheers!

Huff

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Hailey,

A million thanks to you for assisting with this. ive posted a few before with no responses so started to get a little worried about my Netpro reliability factor. The QoS for me is always a little grey to be honest and i normally do Auto QoS, which ill do now as well. ill do some reading and come up with an opimum policy as well.

Ive changed all the groups accordingly, and have checked an all are registering to thier respective subscriber first.

Rob, as always, thanks for making the dilema a little more easier on the brain, i was starting to panic a little there. I have not stopped any serivces as yet as i want to take it one step at a time. ill give it a couple of days and then proceed to stop some services.

As for the recommendations, ill defintely submit those, im not sure why we upgraded to an engineering special, but checking through some of the docs, there was some issue with the backup not displaying properly. Any how, QoS is in discussions with this customer for the past few months so im assuming this should push it along as well.

ill monitor during the course of this week and if you guys dont hear from me for this one, then id say we should be ok.

again thanks, and beers are for you guys both, although i drank about 24 already trying to solve this issue.

PS. i still think the link instability may be the root cause of this and ill schedule a test to attempt to recreate the issue. the test may be:

1. Remove the QoS that im about to put on

2. place the main group back to register with the Pub.

3. pull the WAN link for one of the branches and put it back in after a second.

3. see what happens. - if im correct the issue will occur.

4. place the QoS back in the mix and see what happens - im thinking the same thing may occur but only the test may tell.

If the link is the issue, then ill need to place some type of  delay on the router links to remain down instead of Flapping.

Re: CUCM SERVICE RESTART ON CLUSTER

Rob - as the zen master, your compliments are greatly appreciated and well received.

A - you are welcome, my friend.  Don't give up on the forums yet.  I use them all the time, you just have to sift thru some garbage to find the gems.  Myself and some colleagues are trying to be more active because we see a lot of things we could help with but it's a balancing act - we're a small, very busy company - and our UC practice is booming.  So, we try to make time to both keep a great reputation with our clients and also participate in the forums (mostly on our own time and dime, which is cool).  As for the root cause of your issue, some of the details are muddled at this point but I seem to remeber you saying that things had been isolated to issue at the HQ site alone.  My assumption was this was independent of any WAN outages but I may have overlooked that fact.  In any case, let me finish the night with this final thought:

If you believe that you have a routing issue and/or general instability somewhere in your network - make sure that you sniff that out and resolve it.  It should be a priority to do so.  With or without QoS, if this is happening frequently - you're going to have problems somewhere.  QoS may or may not help the aesthetics of the problem (i.e., phone registration issues).  QoS really would, IMO, likely have little effect on the problem OR the effect may simply be to mask the underlying issue which is not what you want to do.

From there, get your Auto QoS in place and as I said - it's better than no QoS and in certain cases, it's the best choice from a number of angles and primarily supportability and upkeep being the top 2 for most.  It sounds like it's the right thing to do given the scenario.

As for CUCM, stick with my recommendations which aren't mine - they're just best practice.  Pub is a pub.  It can be a backup, as needed.  But 99% of the time, it should focus on CUCM subsystem processes (DB, web administration, etc).  Remote sites can have redundancy via local and remote subscribers - if you go that route, you have most of your bases covered for "what happens if this server goes down".  From there, read up on Rob's link and you'll easily be able to determine what services should run on which hosts based on best practices.

If you're interested, take some time to look thru some of these blogs as well.  There's some good info being posted here all the time.  http://www.netcraftsmen.net/resources/blogs/tags/70/

Good luck.

Hailey

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Hailey,

thanks, sorry im beat as hell so excuse my inconsistent postings. im following the recommendations to a "T" and ill take this one step at a time. One day you're thinking, hummm, im getting good at this IPT thing, the next, its like whatttttt, did i learn anything for the past few years? hehe, good thing i laugh a lot otherwise i think my head would have been in this computer screen....literally.......lol

anyways guys thanks again, ill be trying to resume my duties as well on this forum to help those that didnt read the manual....oh wait...i meant had problems that were unforseen.

Good night guys

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

An update,

things seems to be getting worse since the change was made. My gateways are dropping calls and Branch B voicemail (which they use for AutoAttendant) isnt working.

The phones are NOT dropping but other services are being affected.

Bronze

Re: CUCM SERVICE RESTART ON CLUSTER

Guys,

just a final update for any of those who might have encountered this type of issue.

First off, ive been smoke free for a year and a half now, been through a motorcycle accident and that didnt even give me a little craving, and as this morning, ive hit up five cigs.....lol...i know, im dissapointed as well..no worries im back smoke free tomorrow. Needless to say this was too stressful and hopefully i can do a final post on this issue that may give someone in this situation a less stress.

Hailey and Rob again thanks.

here is the run down.Note this is working with my TAC engineer so if i typed any commands based on the database replication stuff id probably not recommend you do it it if youre like me of course (Limited knowledge steps and procedures for database rebuilds)

Environment

Pub and Sub at head office

Sub at site A - 4MB metro link

Sub at site B - 4MB metro link

Symptom

1. Phones de-register for themselves and then come back up then de-register - started once  then was fine for a month then every day then every hour

2. Gateways (H323 and MGCP) not able to make calls or going out intermittently

3. Incoming Calls no able to complete unless gateway is reset and then calls will not complete after a while again.

4. Phones registered to SITE X can call between themselves but cannot call any other phones registered to any other Sub or Pub locally to themselves.

5. Resetting any component (Meaning gateway, phone,Route List--i mean anything) resets all phones and gateways throughout including other subscribers.

6. Overall weird issues with clustering over the WAN

PROBLEM

Database replication problem at one of the SUB. We determined this by using RTMT and the CLI and checking routing statistics. RTMT showed status 2 on the replication which seemed to be fine, however, the database seemed to be corrupted in any event and after the Cisco Engineer and myself did some digging, it was noted that this Sub (Sub B) was the issue. Strangely enough, it then started to come up in RTMT.

CAUSE (assumption)

Provider Metro link flapping was initial issue. Speaking to the provider, the issue is Network Related and we didnt notice some packet loss. The provide had the WAN link on a separate VLAN on thier side which had a limitation on the amount of MAC Addresses. Since the REPLICATION for CUCM 6.X is mesh it seemed as though the link was starting to get instable and the MAC were either being discarded or something to that effect. - Im not sure of the exact cause of course and this is speculation at this point in time.

The provider increased the amount of MACs on thier equipment and the link came up perfect with no loss.

We then did a database replication stop to all subscribers first, then to the Publisher.  We then issued a database cluster reset on the Publisher.Then verifed everything was fine. We did a utils service list to ensure all services have been started.

We then reboot each server in this order - Pub then subs - waiting in increments of 15 minutes (Just to be safe)

Everything is working as before and we will be monitoring.

thanks again. (Hopefully this is my last post on this subject)

Im now going to see if we can do any configurations on the router to fail the link in the event of flapping.

Re: CUCM SERVICE RESTART ON CLUSTER

Very interesting.  If your provider was having link stability issues along with enforcing MAC-based security policies, DB replication problems are not suprising.  I can actually see how the actual "bad" server may think it is OK (from a DB perspective) - it doesn't necessarily know that it missed an update; however, I would think that the Publisher would show signs of replicaiton issues whether it be replicate state via RTMT for each server or even replication count inconsistencies.  However, I once ran into a similar type of issue in a very large cluster with clustering over the MAN.  The symptoms were just much more benign than you've experienced.  However, it was a 4x implementation and it was easily explained on why the DB replication issue was "hidden" due to how replication actually takes place in 4x.

Nonetheless, I hope you get this ironed out so you can step away from it and de-stress.

Hall of Fame Super Red

Re: CUCM SERVICE RESTART ON CLUSTER

Hey Arvind,

Very good to hear that this complex issue is now resolved

Thanks so much for taking the time to update this thread even though

you are probably sick of thinking about it. +5 points for this very kind

gesture my friend! The one thing that this type of issue really highlights

is the inter-dependencies between the actual Network and what is

riding upon it. I think we all knew that this is where the underlying

issue was, but how to pinpoint it is another thing altogether. Great

work solving this! Now, hopefully, you can back to your "normal" life....

if there is such a thing when you are working with this stuff

Hailey, thanks again for your fine participation on these great Forums!

Cheers!

Huff

4494
Views
32
Helpful
24
Replies
CreatePlease to create content