Can someone recommend a good starting point for
troubleshooting slow failover of phones in a CM group. The current environment we have takes around
a minute before phones start rehoming to the secondary CM.
How is the primary CM becoming unavailable? Is it being unplugged from the network, or is CallManager stopping, or what?
Also, how many phones are registered to the primary when it becomes unavailable and they need to fail over to the secondary?
The last time was a installation of a patch that caused the reboot. Approx 50 phones.
We have also had the same failover issues when the
primary has been accidently unplugged.
If the primary CallManager that has phones actively registered to it is unplugged, it could take a bit longer for these phones to fail over to their secondary CallManager.
Firstly, if the phones have the backup CM listed as "Standby" in their Network Config screens (this would be option 22) then there are 3 ways that the failover can happen:
1) CallManager service is stopped on Active CM. Phones immediately register to their Standby Backup CM.
2) Active CM becomes unavailable either due to network or server problems. Some users start to press buttons generating Skinny messages over the active socket, no response, so the phone registers to Standby Backup CM.
3) Active CM becomes unavailable either due to network or server problems. Phones where the users are not actively trying to use the phone will register to Standby Backup CM when keepalive timers expire with Active Primary CM.
Now that being said, each of the examples above assume that the backup CM is in a "Standby" state on your devices. Do you know if that was the case? If not then it would definitely take at least some more time.
Also, you may be running into an issue where if the phone loses connectivity from a CM for more than 15 minutes while still being active with another CM, it won't reopen its connection with the CM that was down. The defect is CSCdv03757 and this may be impacting your situation but I am not positive of it. The fix for that is in phone loads for 3.1(2a) which is already on CCO and also will be in 3.0(12) which should be available within a couple weeks or so.
The only way to really tell what happened is to see first the status of phone with the primary and backup CM, and a sniffer trace on the phone starting before anything happens to either CM going through the whole thing.
all phones have the backup call manager listed as "standby".
How long should a switchover take? How long should a switchback take?
If the CallManager service is stopped, the devices that have their backup CM as "standby" should failover immediately.
However if they lose connection to the active CM -- for example due to network connectivity problems or server losing power -- then if nobody is trying to press any buttons on the phone it should take between about 61 and 90 seconds, which is the amount of time it would take to miss 3 keepalives with a default keepalive timer of 30 seconds.
That process would be sped up if the user tries to press buttons or pick up the phone and the phone is unable to contact the active CM that has become unavailable.
Hi Dave, we just built a two server cluster with 3.1(2a) loaded and noticed a very slow failover. The phone has the primary callmanager listed as Active and the secondary callmanager listed as Standby. Heres what we found when disconnecting the Active callmanager from the network:
- If the phone was left alone, it would take just over one minute and it would register with the Standby callmanager.
- If we took the phone off hook, we would not get dial tone for about one minute.
So we did not see the immediate failover when we tried to use the phone after disconnecting the Active server. We would take the phone on and off hook repeatedly with no failover until one minute after the Active callmanager failed. However, the return to the primary callmanager was immediate when we reconnected the primary callmanager to the network. I plan to open a TAC case tomorrow.
Thomas, the failover would not be immediate when trying to use the phone after disconnecting the server listed as Active, however it should certainly be less than a minute if I am remembering correctly from experience, definitely if the another server is in Standby on the phone.
Just a sanity check, if you stop the CallManager service on the Active CM, the failover is almost immediate, correct?
We are in the process of rebuilding the servers due to the problem that has occurred when upgrading from 3.0(8) to 3.1(2a). The Callmanager service parameter VoiceMail changes to a True/False field if you upgrade direct to 3.1(2a) without installing 3.1(1) first. Which causes the Messages button to no longer work per TAC. So, I will not be able to test until tomorrow. I did shutdown the primary callmanager (Subscriber for us) and monitored the failover. I am pretty sure it also took a minute before we could use the phone. After the rebuild and we get back to 3.1(2a) without problems, we will retest as you requested.
Here's what I get on a 3.1(2a) system with one CM as Active and one CM as Standby:
- Stop the active CM and the failover is immediate.
- Disconnect the active CM from the network and the failover takes 60 seconds.
After disconnecting the Active CM from the network, I would attempt to get dial tone repeatedly until dial tone was received. This reliably took between 55 and 60 seconds.
Stop the active CM - good, as I expected.
Disconnect active CM - right.
Disconnect active CM then try to get dial tone repeatedly and no change - this I am not too happy about.
Can you plug a sniffer into the back of a phone during this test (before you unplug the active CM) and capture this occuring? I think you should open a case with TAC and get this to them so we can find out from development if this is normal or not.
It looks like the phone continues to send TCP ACK's to the active CM and never times out. I can send you a copy of the sniffer capture file.