Been seeing some of these messages lately - CCM_CALLMANAGER-CALLMANAGER-3-SDLLinkOOS. I know they are caused by network problems but what timer is being used to generate this error. Also is there a way to record when the Link comes back in service in the event log?
CCM_CALLMANAGER-CALLMANAGER-3-SDLLinkOOS: SDL link to remote application out of
This alarm indicates that the local Cisco CallManager has lost communication with the
remote Cisco CallManager.
The most common Lost communication for CallManagers could be Callmanager server hang, network problems or high cpu.
You can use RTMT to monitor this.
CM servers keep a TCP connection to other servers in the cluster. When that
TCP connection is broken due to network connectivity or lack of server
resources, the above error is generated. Therefore, are no specific timers
in Callmanager to control the generation of the above message
Thanks, we use RTMT to monitor CPU utilization, what could you monitor through RTMT to catch network problems or the TCP connection breaking?
I am seeing the same thing,, I have a cluster 1-PUB amd 3 SUBS.
Only two subs are seeing SDL OOS errors.
The ports on the network are clean.
I do have a few hard drives that are 90% fragmented and am wondering if that is aiding in the error.
I do not think it is a network error as no phones go diwn , no other network devices and one sub is 100% fine.
Try registering one phone to a Subscriber that is getting the errors then register another phone to the Publisher. See if you can call from the phone in the SUB to the phone in the PUB and viceversa, if you can not then you will need to reset the CCM service in all the cluster, if possible is better to reboot the servers.
It will be good to check SDL detailed traces with sniffer filtering TCP port 8002 in to find out if this is a Network problem or not. In order to investigue deeply I would check Perfmon stats.
Regarding the Phones not going down this is a different mechanism than SDL going down, but please let me know where IP Phones are register to so problem can be isolated.
Sometimes due to highly fragmented disk can cause heavy disk I/O utilizing all the CPU.
I would open a TAC case for further investigation.
Thanks for all the advise. I ran the Real time monitor and looked at the sdl processes.
One node seemed to be way out of line.
I ran defrag and found a couple servers heavily fragmented as that is what I suspected
I ran the defrag, stopped and started the dbl layer and ccm and the pub wouldnt stop the dbl so I rebooted all the servers.
At this point I have not seen the error in 2 days.
I have re booted the servers in the past and it ran clean for five days. This cluster has been in production for over 18 months.
Throut this all the port counts have been 100% clean on all ports.
I do have a tac case open and no one is willing to except anyting but a network error.
Like I said I have one sub with no errors and the network is clean so I don not think it is the network..
I can call from a phone registered any where and all is okay .
Looking at the big picture and one subscriber is clean , no errors on the network any where I suspect a server is having performance issues.
I hope the defrag and the reboots will fix it.
I will keep you posted and thanks for all the good ideas.
sorry phones are split between 3 subscribers.
PUb has no phones registered to it.
The sdl errors on;y happen to two subs..
One sub has never had a sdl oos error.. That was the wild card and why I suspected a server performance issue..
I just had this error message last month. But the cause may be a different issue from yours. We just put in a new ASA firewall and open all ports. However the version seemed to be unstable and keep disconnecting my subscriber thus causing this error. After rebooting it, the error did not come out anymore.
We have since upgraded the ASA firewall and till now it has been okay.
I encountered that after an OS upgrade when the NIC went back to "auto/auto".
-> Check if the Server NIC and the Switch have the same speed/Duplex settings.