SCCP Phone Keepalives

kerry1234 · ‎06-24-2009

I always understood that a phone would switch to it's 2nd choice CallManager when three keepalives that it sent to its primary CallManager were not acknowledged - with keepalives at the default of 30 second intervals that would be a minimun of 60 secs and a max of 90 seconds.

Testing in the lab has shown different behaviour. If I shut down the port on on the switch to which the phones primary CallManager is connected, the phone will re-home to the secondary call manager almost immediatly after the next keepalive that it sends to it's primary which does not get a response.

I've proven this with a sniffer connected to the back on the phone.

Is this a valid test ?

K

mchin345 · ‎06-30-2009

SCCP Device timeouts have a 3x30 second heartbeat to their primary CallManager. The IP Phone device or Unity port will attempt to contact its primary CallManager three times before a failover occurs. This will be a minimum of 1.5 minutes.

The IP phone sends a KeepAlive message every 30 seconds by default. CallManager should answer each KeepAlive the IP phone sends with an acknowledgement message. If CallManager fails to respond to three consecutive KeepAlive messages, the IP phone marks the connection as "bad". The IP phone does not tear down the TCP connection, but it does not attempt to re-register with the "bad" CallManager either. It continues sending KeepAlive messages to the "bad" CallManager until CallManager tears down the TCP connection. This delay gives CallManager time to respond if it recovers quickly. After 10 minutes, the IP phone removes the "bad" tag and again tries to establish communication with CallManager using KeepAlive messages. During this time, the IP phone attempts to establish a connection with its secondary CallManager (or its tertiary if the secondary is not available) and registers if possible.

kerry1234 · ‎07-02-2009

Thanks for your reply - but my point is that is how I understood the theory to work, but the reality is that when the phone misses the first keepalive response, it will attempt to register with its secondary CUCM straight away, so there's no waiting for the second or third keepalives to time out.

I'm not really clear in my own mind what the three missed keepalive theory actually acheives

Kerry

Sushil Kumar Katre · ‎07-02-2009

Hi Kerry,

I think this has something to do with the TCP session.

Here's what's discussed in the Gateway & Gatekeeper book -

The CallManager process on the active CallManager is manually shut down This results in the TCP connection between the IP phone and the CallManager being closed. If no backup CallManager is available, the IP phone immediately attempts to register with the SRST gateway.

IP connectivity between the IP phone and the CallManager is broken (this is your case) When the IP phone sends its first keepalive after the TCP connection is broken, it sends TCP retries for 20 to 25 seconds. Then it initiates registration with the SRST gateway.

The CallManager process is locked In this case, the TCP connection is not closed. The IP phone waits for the keepalives to expire before initiating registration with the SRST gateway. This can take up to 90 seconds. This occurs when the server operating system (OS) is still functioning. If the entire server fails or is shut down, the TCP retries fail, as described in the previous bullet.

So when the connection to callmanager is lost, I don't know how much time it takes for the TCP connection to get broken. After which phone will send TCP retry for 25-25 seconds which might not be happening in your case.

I guess you should analyze the TCP flow as well to figure out what's happening.

Even I would have thought that it would have taken some time for the phone to callback to the other callmanager (not immediate).

Instead of shutting down the port, have you tried pulling the ethernet cable of callmanager (not sure whether its going to make any difference, just a thought.)

-> Sushil