Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Announcements

Welcome to Cisco Support Community. We would love to have your feedback.

For an introduction to the new site, click here. And see here for current known issues.

New Member

CRS Engine restarts every 2-3 days due to NOT_OK response from Watchdog

Hi,

we have 2 Cisco Unified IP IVR servers running version 7.0(1)SR03_Build011. Every 2-3 days, the CRS engine restarts at different times on both servers due to a Watchdog Thread received NOT_OK response from process CRS Engine.

These servers run independently of each other - (i.e not an HA pair) - but over the weekend, both servers had a CRS engine restart at the same time. I've looked at the MIVR and MCVD logs and they confirm this, but are so detailed, I can't actually still see what the cause is. There are a number of errors of different types, where the log seems to show a lot of 'exceptions', but it seems to lose connection to the Call Manager that causes the restart and it mentions buffer space.

We have a 3rd server which is not part of the solution that the other 2 servers provide, but it has the same OS, the same CRS application version and is connected to the same Call Manager; (version 6.1.3-200); but this server doesn't restart. It is on the same subnet as the other 2 servers.

The event log looks like this:-

Event Type: Information
Event Source: Cisco Unified CCX Node Manager
Event Category: Devices
Event ID: 3
Date:  8/30/2010
Time:  7:38:18 PM
User:  N/A
Computer: CBXCCM2IVR01
Description:
The description for Event ID ( 3 ) in Source ( Cisco Unified CCX Node Manager ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: WatchdogThread: received NOT_OK response from process CRS Engine, , , , .
Data:
0000: 06 00 ff 00 00 00 00 00   .......
0008: 00 00 00 00 03 00 01 21   .......!
0010: 10 0d f0 83 72 48 cb 01   ..?rH.
0018: 58 00 00 00 00 05 41 00   X.....A.
0020: 6e 6d 00 43 42 58 43 43   nm.CBXCC
0028: 4d 32 49 56 52 30 31 00   M2IVR01.
0030: 57 61 74 63 68 64 6f 67   Watchdog
0038: 54 68 72 65 61 64 3a 20   Thread:
0040: 72 65 63 65 69 76 65 64   received
0048: 20 4e 4f 54 5f 4f 4b 20    NOT_OK
0050: 72 65 73 70 6f 6e 73 65   response
0058: 20 66 72 6f 6d 20 70 72    from pr
0060: 6f 63 65 73 73 20 43 52   ocess CR
0068: 53 20 45 6e 67 69 6e 65   S Engine
0070: 00 00 00 00 00 00 00 00   ........

I have attached the MIVR log, but when the error occurs the relevent part of the MIVR log shows the following:-

3362183: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at EDU.oswego.cs.dl.util.concurrent.ClockDaemon$RunLoop.run(ClockDaemon.java:630)
3362184: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at com.cisco.util.ThreadPoolFactory$ThreadImpl.run(ThreadPoolFactory.java:853)
3362185: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION:Caused by: java.net.SocketException: No buffer space available (maximum connections reached?): connect
3362186: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.PlainSocketImpl.socketConnect(Native Method)
3362187: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.PlainSocketImpl.doConnect(Unknown Source)
3362188: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
3362189: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.PlainSocketImpl.connect(Unknown Source)
3362190: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.SocksSocketImpl.connect(Unknown Source)
3362191: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.Socket.connect(Unknown Source)
3362192: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.Socket.connect(Unknown Source)
3362193: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.Socket.<init>(Unknown Source)
3362194: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at java.net.Socket.<init>(Unknown Source)
3362195: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: at com.cisco.rmi.LoopbackClientSocketFactory.createSocket(LoopbackClientSocketFactory.java:73)
3362196: Aug 30 19:38:16.033 BST %MIVR-CLUSTER_MGR-4-EXCEPTION: ... 12 more
3362197: Aug 30 19:38:16.096 BST %MIVR-SS_TEL-7-UNK:RP[num=40600], conn=[40600:CCM2IPT/(P1-CBXCTI_User_1) GCID=(3,5066916)->INVALID]->DISCONNECTED, event=CallCtlConnDisconnectedEv, cause=Other: 17[17], meta=META_CALL_ENDING[132]
3362198: Aug 30 19:38:16.518 BST %MIVR-SS_TEL-7-UNK:RP[num=40600], conn=[40600:CCM2IPT/(P1-CBXCTI_User_1) GCID=(3,5066917)->INVALID]->DISCONNECTED, event=CallCtlConnDisconnectedEv, cause=Other: 17[17], meta=META_CALL_ENDING[132]
3362199: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-THROWS_KEEP_ALIVE_EXCEPTION:Cluster Manager throws KeepAlive Exception: Exception=com.cisco.wfapi.WFKeepAliveException: MANAGER_CONNECTION_TO_PUBLISHER_LOST
3362200: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION:com.cisco.wfapi.WFKeepAliveException: MANAGER_CONNECTION_TO_PUBLISHER_LOST
3362201: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.cluster.impl.manager.AbstractClusterManager.restart(AbstractClusterManager.java:599)
3362202: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.cluster.impl.manager.Publisher.notifyOne(Publisher.java:104)
3362203: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.cluster.impl.manager.AbstractClusterManager$1.run(AbstractClusterManager.java:667)
3362204: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.executor.impl.ExecutorStubImpl$RequestImpl.runCommand(ExecutorStubImpl.java:690)
3362205: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.executor.impl.ExecutorStubImpl$RequestImpl.run(ExecutorStubImpl.java:486)
3362206: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.executor.impl.ExecutorStubImpl$RequestImpl.run(ExecutorStubImpl.java:762)
3362207: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at EDU.oswego.cs.dl.util.concurrent.ClockDaemon$RunLoop.run(ClockDaemon.java:630)
3362208: Aug 30 19:38:18.705 BST %MIVR-CLUSTER_MGR-2-EXCEPTION: at com.cisco.util.ThreadPoolFactory$ThreadImpl.run(ThreadPoolFactory.java:853)
3362209: Aug 30 19:38:18.705 BST %MIVR-NODE_MGR-1-NODE_MGR_KEEP_ALIVE_ERROR:Node Manager keep alive ping failed: Exception=com.cisco.wfapi.WFKeepAliveException: KeepAliveException in Manager/Startable ; nested exception is:
com.cisco.wfapi.WFKeepAliveException: MANAGER_CONNECTION_TO_PUBLISHER_LOST
3362210: Aug 30 19:38:18.705 BST %MIVR-NODE_MGR-1-EXCEPTION:com.cisco.wfapi.WFKeepAliveException: KeepAliveException in Manager/Startable ; nested exception is:
3362211: Aug 30 19:38:18.705 BST %MIVR-NODE_MGR-1-EXCEPTION: com.cisco.wfapi.WFKeepAliveException: MANAGER_CONNECTION_TO_PUBLISHER_LOST

2 REPLIES
Super Bronze

Re: CRS Engine restarts every 2-3 days due to NOT_OK response fr

Hi

If it happened to two servers at the same time then I'd be looking off box for problems.

- Check whether your CCMs were stable (use RTMT and check for events at the time)

- Run a 'show spanning-tree active detail | i VLAN|hange' or similar to check for STP topology changes on the VLAN the servers are in. Short outages caused by bad port configs in the VLAN can cause CCX/IPIVR to get upset briefly and fail over when communication between different processes on the same box fall over; it can be very sensitive. Maybe also do a show log on the switches to see whether any other significant events happened at the same time.

Regards

Aaron

Please rate helpful posts..

Aaron Please remember to rate helpful posts to identify useful responses, and mark 'Answered' if appropriate!
New Member

Re: CRS Engine restarts every 2-3 days due to NOT_OK response fr

Hi Everyone,

This has now been resolved.

This error was causing 'Java Heap Errors' which in turn is caused by large VXML pages.

We are using a bespoke application which sends a URL to a Webserver and gets a VXML page back in response.

This is fixed in Engineering Special ES05, so we had to apply SR05 first and then ES05, once SR05 had been applied.

We also have 2 other CRS / IVR / UCCX servers that don't use any HTTP / VXML applications and these servers never experienced this problem.

(I didn't need to upgrade these servers therefore).

Hope this helps in anyone having similar problems.

Regards,

Peter

803
Views
0
Helpful
2
Replies