ip phones failover to srst router even when CUCM is still reachable

Unanswered Question
Aug 10th, 2009
User Badges:

Guys, i have the following topology:


CUCM---[switch1]-------[switch2]-----IP phones

|

|

SRST router


I have on both switches data vlans and voice vlans.

Some of the phones at a random time failover to the srst router while the others are still registered to the CUCM.

There is a trunk port between both switches.

I checked the traces on the CUCM and i found a socket broken message when the ip phone is unregistered from the CUCM.

I plugged an ip phone directly on switch1 to check whether the same problem persists, i found that the tested phone works great.

Also i checked if there are some input/output errors on the interfaces on both switches, bt everything is great.

Did anyone face a similar problem before?

Please advice

Regards,

Moustapha


  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Wilson Samuel Mon, 08/10/2009 - 10:37
User Badges:
  • Gold, 750 points or more
  • Community Spotlight Award,

    Mobile User, July 2015

Hi Moustapha,


It seems more like a connectivity issue than anything else at your switch level.


Could you please verify if the connectivity between the switches themselves and the switch and phones are perfect.


You may want to do a sho interface and check the err after you clear the stats for atleast 24 hours and see if the err are increasing?


Please rate if it helps


Regards

wilson Samuel

Mustafa Al Housami Mon, 08/10/2009 - 21:06
User Badges:

Well,

i already checked that, and no errors are displayed on the interfaces.

The CUCM and SRST router's config are correct, and i am still investigating to check this issue.

I anyone has any other suggestions, please help

Thanks

vipersl65 Tue, 08/11/2009 - 00:04
User Badges:

Isolate the problem further.


1)Phone loads firm version. Are they the same for all the phones having the same phone model?


2)Switch OS version, are they the same?

3)Locate a phone that failover and note the switch port it is connected. Swap it with a phone that NEVER failed over and not the switch and the port it was connected to, observe.



koziollz1 Thu, 05/19/2011 - 08:07
User Badges:

I am also experiencing this issue? Have you been able to resolve it OP? Do you recall what it was?


Can anyone else provdie any input? Has anyone else ran into this and sucesfully resolve it?


I pulled CUCM traces and see the following for multiple phones during the reported SRST timeframe:


05/16/2011 10:05:11.183 CCM|StationInit: TCPPid = [1.100.7.1599823]Socket Broken. DeviceName=,IPAddr=10.11.3.120, Port=0x33ff, Device Controller=[0,0,0]|

05/16/2011 10:07:06.117 CCM|StationInit: TCPPid=[ 1.100.7.1590130] Keep alive timeout.|


My topology is simplified comapred to OP:


                       SRST router

                                |

                                |

CUCM1&2---[core switch]-------rest of network, switches, IP phones

                                |

                                |

                       IP Phones


Unfortunetly I did not catch the show log fromt he router before it buffered out.


Any help would be greatly apprecaited! 
clileikis Thu, 05/19/2011 - 08:35
User Badges:
  • Gold, 750 points or more

Sounds like a network problem somewhere between the IP phones and call manager. How often is this happening? Any network issues?

koziollz1 Thu, 05/19/2011 - 08:50
User Badges:

This is the first time I know of it has ever happened.


The system has been running stable for years.The router was rebooted back in December, has been running stable as well.


None of the remote sites experienced any issues at this time, so we know the connectivity between GW<----->Switch<----->CMs is well and stable.


Pinging the CM from router is good:



#ping 10.11.1.131

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 10.11.1.131, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/1 ms


#ping 10.11.1.132

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 10.11.1.132, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/1 ms


#trace 10.11.1.131

Type escape sequence to abort.

Tracing the route to 10.11.1.131

  1 10.11.1.131 0 msec 0 msec 4 msec


#trace 10.11.1.132

Type escape sequence to abort.

Tracing the route to 10.11.1.132

  1 10.11.1.132 0 msec 0 msec 0 msec



Pinging the router from CM is good:


admin:utils network ping 10.11.1.130

PING 10.11.1.130 (10.11.1.130) 56(84) bytes of data.

64 bytes from 10.11.1.130: icmp_seq=0 ttl=255 time=1.05 ms

64 bytes from 10.11.1.130: icmp_seq=1 ttl=255 time=0.625 ms

64 bytes from 10.11.1.130: icmp_seq=2 ttl=255 time=0.681 ms

64 bytes from 10.11.1.130: icmp_seq=3 ttl=255 time=0.641 ms


--- 10.11.1.130 ping statistics ---

4 packets transmitted, 4 received, 0% packet loss, time 3098ms

rtt min/avg/max/mdev = 0.625/0.750/1.055/0.179 ms, pipe 2



admin:utils network traceroute 10.11.1.130

1  10.11.1.130 (10.11.1.130)  0.646 ms *  0.540 ms



admin:utils network ping 10.11.1.130

PING 10.11.1.130 (10.11.1.130) 56(84) bytes of data.

64 bytes from 10.11.1.130: icmp_seq=0 ttl=255 time=0.527 ms

64 bytes from 10.11.1.130: icmp_seq=1 ttl=255 time=0.575 ms

64 bytes from 10.11.1.130: icmp_seq=2 ttl=255 time=0.517 ms

64 bytes from 10.11.1.130: icmp_seq=3 ttl=255 time=0.632 ms


--- 10.11.1.130 ping statistics ---

4 packets transmitted, 4 received, 0% packet loss, time 3013ms

rtt min/avg/max/mdev = 0.517/0.562/0.632/0.054 ms, pipe 2



admin:utils network traceroute 10.11.1.130

1  10.11.1.130 (10.11.1.130)  0.695 ms *  0.654 ms

koziollz1 Thu, 05/19/2011 - 09:00
User Badges:

The SDL traces show the following for the same IP address (just using one of the phones):


049538145

2011/05/16 10:05:11.183

001

AlarmErr


AlarmClass: CallManager

AlarmName: DeviceTransientConnection

AlarmSeverity: Error AlarmMessage:

AlarmDescription: Transient connection attempt.

AlarmParameters:  ConnectingPort:13311

DeviceName:

IPAddress:10.11.3.120

Protocol:SCCP

DeviceType:255

Reason:6

AppID:Cisco CallManager

ClusterID:CALLMGR1-Cluster

NodeID:cm1

jaylena123 Thu, 07/14/2011 - 16:49
User Badges:

we are having this same "socket broken" issue.  did you ever find a resolution?

Tapan Dutt Thu, 07/14/2011 - 17:57
User Badges:
  • Cisco Employee,

Hi


We need to isolate the issue first...


Is it the phone or the Switch or the Call manager  the Culprit?


Concept is simple: as soon as the phone looses coonectivity and does not receive a keepalive within 90 seconds it failovers either to the secondary call manager server or the SRST mode.


HTH

Tapan

Tapan Dutt Thu, 07/14/2011 - 17:59
User Badges:
  • Cisco Employee,

adding to above check the Debug Display on the Web Page of the phone(accessible when you click on the IP address of the phone in CCM
)

pmvillarama Fri, 07/15/2011 - 00:31
User Badges:

Hi,


I think the cause was already isolated when you transferred the phones to a different switch. Did you find anything from switch 1?

koziollz1 Fri, 07/15/2011 - 07:10
User Badges:

Let my try to respond to the new posts:




The issue only occurred once, client requested reason-for-outage hence what began the investigation. We went through collecting packet captures, traces, logs, etc etc etc the whole nine yards. We did not find any issues internally on the network. The traces/logs pulled from the issue occurred all point to no connectivity to CallManager, but no reason why. We are not dropping packets and the response times are better than ideal. The issue has never occurred since.




What I find is extremely strange is:


- Only a few random phones went into SRST


- The CUCM and router are both connected via the core switch, hence if they dropped connectivity to cm but not the router seems odd as they can clearly communicate through said switch


- Other phones were working just fine at the time


- The phones were scattered on different ports and backplanes


- I highly doubt its hardware failing on specific ports, very unlikely to happen all the same time and work again after


- We had no network monitoring alerts regarding critical devices loosing connectivity (servers, router, switches)


- It was just a few random phones, internal network, that registered via SRST for a few minutes.....makes no sense....and unable to locate an answer.......Cisco TAC was unable to assist with a reason-for-outage as well, only suggestion was wait for it to happen again......which if it does I doubt we'd be able to collect any different or better data


- And just to play devil's advocate: if it smells like a bug, walks like a bug, talks like a bug......



So issue has never occured again, but we are still in the dark as to why it did happen.....sorry I dont have a better answer for all

Actions

This Discussion