CRS Engine restarts every 2-3 days due to NOT_OK response from Watchdog

Unanswered Question
Aug 31st, 2010
User Badges:

Hi,


we have 2 Cisco Unified IP IVR servers running version 7.0(1)SR03_Build011. Every 2-3 days, the CRS engine restarts at different times on both servers due to a Watchdog Thread received NOT_OK response from process CRS Engine.


These servers run independently of each other - (i.e not an HA pair) - but over the weekend, both servers had a CRS engine restart at the same time. I've looked at the MIVR and MCVD logs and they confirm this, but are so detailed, I can't actually still see what the cause is. There are a number of errors of different types, where the log seems to show a lot of 'exceptions', but it seems to lose connection to the Call Manager that causes the restart and it mentions buffer space.


We have a 3rd server which is not part of the solution that the other 2 servers provide, but it has the same OS, the same CRS application version and is connected to the same Call Manager; (version 6.1.3-200); but this server doesn't restart. It is on the same subnet as the other 2 servers.


The event log looks like this:-


Event Type: Information
Event Source: Cisco Unified CCX Node Manager
Event Category: Devices
Event ID: 3
Date:  8/30/2010
Time:  7:38:18 PM
User:  N/A
Computer: CBXCCM2IVR01
Description:
The description for Event ID ( 3 ) in Source ( Cisco Unified CCX Node Manager ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: WatchdogThread: received NOT_OK response from process CRS Engine, , , , .
Data:
0000: 06 00 ff 00 00 00 00 00   .......
0008: 00 00 00 00 03 00 01 21   .......!
0010: 10 0d f0 83 72 48 cb 01   ..?rH.
0018: 58 00 00 00 00 05 41 00   X.....A.
0020: 6e 6d 00 43 42 58 43 43   nm.CBXCC
0028: 4d 32 49 56 52 30 31 00   M2IVR01.
0030: 57 61 74 63 68 64 6f 67   Watchdog
0038: 54 68 72 65 61 64 3a 20   Thread:
0040: 72 65 63 65 69 76 65 64   received
0048: 20 4e 4f 54 5f 4f 4b 20    NOT_OK
0050: 72 65 73 70 6f 6e 73 65   response
0058: 20 66 72 6f 6d 20 70 72    from pr
0060: 6f 63 65 73 73 20 43 52   ocess CR
0068: 53 20 45 6e 67 69 6e 65   S Engine
0070: 00 00 00 00 00 00 00 00   ........


I have attached the MIVR log, but when the error occurs the relevent part of the MIVR log shows the following:-


3362183: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at EDU.oswego.cs.dl.util.concurrent.ClockDaemon$RunLoop.run(ClockDaemon.java:630)
3362184: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at com.cisco.util.ThreadPoolFactory$ThreadImpl.run(ThreadPoolFactory.java:853)
3362185: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION:Caused by: java.net.SocketException: No buffer space available (maximum connections reached?): connect
3362186: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.PlainSocketImpl.socketConnect(Native Method)
3362187: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.PlainSocketImpl.doConnect(Unknown Source)
3362188: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
3362189: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.PlainSocketImpl.connect(Unknown Source)
3362190: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.SocksSocketImpl.connect(Unknown Source)
3362191: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.Socket.connect(Unknown Source)
3362192: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.Socket.connect(Unknown Source)
3362193: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.Socket.<init>(Unknown Source)
3362194: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.Socket.<init>(Unknown Source)
3362195: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at com.cisco.rmi.LoopbackClientSocketFactory.createSocket(LoopbackClientSocketFactory.java:73)
3362196: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: ... 12 more
3362197: Aug 30 19:38:16.096 BST %MIVR-SS_TEL-7-UNK:RP[num=40600], conn=[40600:CCM2IPT/(P1-CBXCTI_User_1) GCID=(3,5066916)->INVALID]->DISCONNECTED, event=CallCtlConnDisconnectedEv, cause=Other: 17[17], meta=META_CALL_ENDING[132]
3362198: Aug 30 19:38:16.518 BST %MIVR-SS_TEL-7-UNK:RP[num=40600], conn=[40600:CCM2IPT/(P1-CBXCTI_User_1) GCID=(3,5066917)->INVALID]->DISCONNECTED, event=CallCtlConnDisconnectedEv, cause=Other: 17[17], meta=META_CALL_ENDING[132]
3362199: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-THROWS_KEEP_ALIVE_EXCEPTION:Cluster Manager throws KeepAlive Exception: Exception=com.cisco.wfapi.WFKeepAliveException: MANAGER_CONNECTION_TO_PUBLISHER_LOST
3362200: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION:com.cisco.wfapi.WFKeepAliveException: MANAGER_CONNECTION_TO_PUBLISHER_LOST
3362201: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.cluster.impl.manager.AbstractClusterManager.restart(AbstractClusterManager.java:599)
3362202: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.cluster.impl.manager.Publisher.notifyOne(Publisher.java:104)
3362203: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.cluster.impl.manager.AbstractClusterManager$1.run(AbstractClusterManager.java:667)
3362204: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.executor.impl.ExecutorStubImpl$RequestImpl.runCommand(ExecutorStubImpl.java:690)
3362205: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.executor.impl.ExecutorStubImpl$RequestImpl.run(ExecutorStubImpl.java:486)
3362206: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.executor.impl.ExecutorStubImpl$RequestImpl.run(ExecutorStubImpl.java:762)
3362207: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at EDU.oswego.cs.dl.util.concurrent.ClockDaemon$RunLoop.run(ClockDaemon.java:630)
3362208: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.util.ThreadPoolFactory$ThreadImpl.run(ThreadPoolFactory.java:853)
3362209: Aug 30 19:38:18.705 BST %MIVR-NODE_MGR-1-NODE_MGR_KEEP_ALIVE_ERROR:Node Manager keep alive ping failed: Exception=com.cisco.wfapi.WFKeepAliveException: KeepAliveException in Manager/Startable ; nested exception is:
com.cisco.wfapi.WFKeepAliveException: MANAGER_CONNECTION_TO_PUBLISHER_LOST
3362210: Aug 30 19:38:18.705 BST %MIVR-NODE_MGR-1-EXCEPTION:com.cisco.wfapi.WFKeepAliveException: KeepAliveException in Manager/Startable ; nested exception is:
3362211: Aug 30 19:38:18.705 BST %MIVR-NODE_MGR-1-EXCEPTION: com.cisco.wfapi.WFKeepAliveException: MANAGER_CONNECTION_TO_PUBLISHER_LOST

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Aaron Harrison Tue, 08/31/2010 - 06:30
User Badges:
  • Super Bronze, 10000 points or more
  • Community Spotlight Award,

    Member's Choice, May 2015

Hi


If it happened to two servers at the same time then I'd be looking off box for problems.


- Check whether your CCMs were stable (use RTMT and check for events at the time)

- Run a 'show spanning-tree active detail | i VLAN|hange' or similar to check for STP topology changes on the VLAN the servers are in. Short outages caused by bad port configs in the VLAN can cause CCX/IPIVR to get upset briefly and fail over when communication between different processes on the same box fall over; it can be very sensitive. Maybe also do a show log on the switches to see whether any other significant events happened at the same time.


Regards


Aaron


Please rate helpful posts..

Peter Bishop Fri, 11/19/2010 - 06:25
User Badges:

Hi Everyone,


This has now been resolved.


This error was causing 'Java Heap Errors' which in turn is caused by large VXML pages.


We are using a bespoke application which sends a URL to a Webserver and gets a VXML page back in response.


This is fixed in Engineering Special ES05, so we had to apply SR05 first and then ES05, once SR05 had been applied.


We also have 2 other CRS / IVR / UCCX servers that don't use any HTTP / VXML applications and these servers never experienced this problem.


(I didn't need to upgrade these servers therefore).


Hope this helps in anyone having similar problems.


Regards,


Peter

Actions

This Discussion