ICM router process restarting intermittently

Shalid Kurunnan Chalil · ‎08-03-2013

Hi Guys,

My cuurent setup :

UCCE 8.5.4

Side A:

Rogger A, PGA (AgentPG),VRU PGA, AW/HDS A, pub, sub1,sub2,CVPA,CVPB

Side B:

Rogger B, PGB (AgentPG),VRU PGB, AW/HDS B, sub3,sub4,CVPC,CVPD

I am getting error in the Rogger B as below,

1. Connectivity with duplexed partner has been lost due to a failure of the private network, or duplexed partner is out of service.

2. MDS is out of service.

3. MDS has reported failure to the router that it is out of service.

4. Message Delivery Service (MDS) feed from the Router to the Logger has failed.

5. Central Controller service is unavailable.

6. Requesting MDS termination due to error.

7. Application Gateway has been taken out of service. Application Gateway ID - 5000

8. Client rtr stopping due to error.

9. Client hlgr stopping due to error.

10. Client clgr stopping due to error.

11. Synchronizer is unable to establish connection to peer.

12. Process rtr on ICM\suly\RouterB has detected a failure. Node Manager is restarting the process.

13. Process rtr on ICM\suly\RouterB is down after running for 366 seconds. It will restart after delaying 1 second for related operations to complete.

14. ICM\suly\LoggerB node process hlgr exited cleanly and requested that it be restarted by the Node Manager.

but when I check with network team, they have confirmed that there is no issue at network side.

Check the router logs. mds logs. ccagent logs: all these couldnt give other than stopping due to error. what might be causing it to close all these process in the rogger B.

could you guys please advice me to troubleshoot this issue.

Regards,

Shalid K.C

david.macias · ‎08-04-2013

Does the rtr service ever go active? If so, have you reviewed the startup log to ensure you're not seeing these same errors? If you're not seeing any errors most of the time, then it seems that you have a network issue or perhaps a route change and the network is fixing it, but not fast enough for the router's liking. My thought would be to not play the cat and mouse game with the network team. Have them put a network sniffer on your segment and wait for the next time this happens. This will be the fasted and easiest way to solve this asap.

david

Blog

Hossain Ahmed Ashfaque · ‎08-04-2013

Kindly check the speed of the network cards.

Ashfaque

Senthil Kumar Sankar · ‎08-05-2013

Hello Shalid,

As david suggested Network sniffer would really help in this scenario.

Additionally, there are few Best practices need to be followed based on the UCCE Windows environment. I would suggest you to crossverify that as well.

http://www.cisco.com/en/US/products/sw/custcosw/ps1001/products_tech_note09186a00808160f4.shtml

http://docwiki.cisco.com/wiki/Contact_Center_Networking:_Offload,_Receive_Side_Scaling_and_Chimney

Regards,

Senthil

Shalid Kurunnan Chalil · ‎08-05-2013

Thank you guys..

Let me check as you suggested and keep you posted....

Regards,

Shalid K.C

Shalid Kurunnan Chalil · ‎08-06-2013

Hi Senthi and others,

one more thing which I noticed in this is below

Rogger, Agent PG and VRU pg of side B is installed in a single VM ware machine,

today i noticed that all three showing the similar kind of behavior, restarting its essencial process intermittently.

so i suspect there might be some issue with the system itself

but due to my insufficient knowldge in vmware machine i dont know what to check here..

could you guys can help me to suggest what check needs to bedone to isolate the issue..

Regards,

Shalid K.C

Senthil Kumar Sankar · ‎08-06-2013

I think you have the deployment in the UCS Servers which has 3 VM's for each component(Rogger/Agent PG/VRU PG)

Do you use B Series or C series Servers ? I would suggest you check the Network Recommendations for UCCE on UCS servers

http://docwiki.cisco.com/wiki/UCS_Network_Configuration_for_UCCE#UCCE_on_UCS_C-Series_Network_Configuration

http://docwiki.cisco.com/wiki/UCS_Network_Configuration_for_UCCE#UCCE_on_UCS_B-Series_Network_Configuration

Do you have Windows 2008 or 2003 OS ? Can you check the given recommendation on the below given NIC

http://docwiki.cisco.com/wiki/Contact_Center_Networking:_Offload,_Receive_Side_Scaling_and_Chimney

Regards,

Senthil

Rate if it helps

Shalid Kurunnan Chalil · ‎08-07-2013

Hi Guys,

I have checked the offload setting and please find below,

C:\Users\icmadmin>netsh int ip sh offload

Interface 1: Loopback Pseudo-Interface 1

udp transmit checksum supported.

tcp transmit checksum supported.

udp receive checksum supported.

tcp receive checksum supported.

Interface 11: Public

ipv4 transmit checksum supported.

udp transmit checksum supported.

tcp transmit checksum supported.

tcp large send offload supported.

ipv4 receive checksum supported.

udp receive checksum supported.

tcp receive checksum supported.

Interface 13: Private

ipv4 transmit checksum supported.

udp transmit checksum supported.

tcp transmit checksum supported.

tcp large send offload supported.

ipv4 receive checksum supported.

udp receive checksum supported.

tcp receive checksum supported.

C:\Users\icmadmin>netsh int sh offload

C:\Users\icmadmin>netsh int tcp show global

Querying active state...

TCP Global Parameters

----------------------------------------------

Receive-Side Scaling State : enabled

Chimney Offload State : automatic

NetDMA State : enabled

Direct Cache Acess (DCA) : disabled

Receive Window Auto-Tuning Level : normal

Add-On Congestion Control Provider : ctcp

ECN Capability : disabled

RFC 1323 Timestamps : disabled

and check the offload for both private and public and it is enabled as RX and TX enabled.

So i am going to disable these offload as per the docmnet.

Please let me know if you guys have any suggestion on it.

Regards,

Shalid K.C

b717k · ‎08-07-2013

Hi Shalid,

This is happening due to only private link flaps, your network team/WAN team should look to this issue.

Set up wireshark on Private IPs and run mds logs or enable performance monitor on rogger and PGs.

Ensure when u set all this your network is not carrying too much of traffic or it would heavly impact the production enviornment.

thanks

Shalid Kurunnan Chalil · ‎08-14-2013

Hi Bala,

I havent perform the above step mentioned.. i will working with network team to get this done.

@ Senthil:,

done the complete configuration as mentioned in the doc, but no luck . still the PGB and RoggerB is critical process getting restarted.

Regards,

Shalid K.C

Hardik Kansara · ‎08-31-2013

Hi Shalid,

Error message itself shows that the issue with your private network......

Connectivity with duplexed partner has been lost due to a failure of the private network, or duplexed partner is out of service.

You should check your private network connectivity.

for corss verify....... do as below.

Ping from sideA server to SideB server to both the IP Address.( Private and Visible )

and check that on which ip address you are observing packet drops. and whenever you observed packet drops in private nic , check the process at the same time.

Thanks & Regards,

Hardik B Kansara

Shalid Kurunnan Chalil · ‎01-02-2014

Thank you all for the reply.

done all above and cross verified.. everything seems to be fine.

here what i found, i believe this might be causing the issue,

we notices that the all switches are connected to our UCCE servers in site B are Fast Ethernet which having maximum capacity of 100Mbps. But Cisco recommended that it should be 1000Mbps.

Which might be the major reason for the outage which happening only in Site B . Where in Site A we have switch which are 1000Mbps.

Please refer the below doc from Cisco regarding the network requirement for UCS C series servers.

Network Requirements for UCS C Series Servers

The below design is the default and recommended for all Unified CCE deployments on UCS C series servers. Exampled in Figures 10 and 11 are two possible network side implementations of the same vSphere Hypervisor vSwitch design. This design calls for using the VMware NIC Teaming (without load balancing) of vmnic interfaces in an Active/Standby configuration through alternate and redundant hardware paths to the network, thereby preventing any single point of failure from affecting the Visible or Private network communications.

The network side implementation does not have to exactly match either of those illustrated below, but it must allow for redundancy and not allow for single points of failure affecting both Visible and Private network communications. There are more possible ways that this could be implemented in a supportable design than can be covered here.

Requirements:

Ethernet interfaces must be Gigabit speed, and connected to Gigabit Ethernet switches. 10/100 Ethernet is not supported.
No single point of failure is allowed for visible and private networks.
Network switch infrastructure cannot use Cisco Stacking technology to combine all switches the UCS C series server is connected to into a single virtual switch.
Network switches must be configured properly for connection to VMware Hypervisor. Please refer to the below for details on ensuring proper switch configuration to prevent STP delay in failover/fallback scenarios.

Reference: http://docwiki.cisco.com/wiki/UCS_Network_Configuration_for_Unified_CCE

So we believe we have to replace switched for a smooth functioning of CC. we are awaiting to replcae the switch to further...

Regards,

Shalid K.C