WAAS goes in "CMS status offline"

Unanswered Question
Aug 23rd, 2010

I am having a WAAS (WAVE 574) that goes in CMS status offline from time to time. It recovers in some minutes but is is happening several times an hour.

The WAAS is logging:

2010 Aug 23 09:55:40 n25320 java: %WAAS-CMS-4-716058: ce(StatsTransmitter): Unable to contact CM [10.81.33.136] for statistics reporting:unicorn.RpcException: Unmarshaled: 9001
2010 Aug 23 10:01:58 n25320 java: %WAAS-CMS-4-716058: ce(StatsTransmitter): Unable to contact CM [10.81.33.136] for statistics reporting:unicorn.RpcException: Unmarshaled: 9001
2010 Aug 23 10:08:17 n25320 java: %WAAS-CMS-4-716058: ce(StatsTransmitter): Unable to contact CM [10.81.33.136] for statistics reporting:unicorn.RpcException: Unmarshaled: 9001
2010 Aug 23 10:08:26 n25320 java: %WAAS-CMS-4-716058: ce(DataFeedPoll): Cannot get updates from 10.81.33.136
2010 Aug 23 10:22:45 n25320 java: %WAAS-CMS-4-716058: ce(StatsTransmitter): Unable to contact CM [10.81.33.136] for statistics reporting:unicorn.RpcException: Unmarshaled: 9001

I am running WAAS v. 4.1.7.11

The box was running 4.2.1 when delivered and it has been downgraded to 4.1.7.

I have tried to make a CMS deregister, wr e, reload, CMS enable. It looked as if it was working again, but the fault returns.

Any one have an idea ?

Best regards

Per Overgaard

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (1 ratings)
Loading.
jomerril Mon, 08/23/2010 - 07:46

A WAE will go "Offline" in the CM GUI when it has missed 3 polling cycles.  By default, each cycle is 5 minutes.  So, after 15 minutes of failed connection attempts by the WAE to the CM, the CM will mark the WAE "Offline".  These are TCP connections from the WAE to the CM on port 443.  You will want to look for general connectivity problems between the WAE and CM, and also look in the CM's logs to see whether it was too busy at the time to take the connection requests from the WAE (something that may happen in very large--thousands of devices--WAAS deployments).

There is another mechanism that the CM may use for detecting the WAE's online state: "Fast Device Offline Detection".  This mechanism uses small heartbeat packets sent over UDP.  This is great for smaller deployments and lab environments.  However, if there are a large number of WAEs, or if the rate, count or detection time are tuned too low that can also lead to a false Offline status.

So, in the case of the standard Offline detection, based on CMS polling cycles, anything that causes the WAE to fail to connect to the CM for 3 consecutive polling cycles (default is 5 minutes between each attempt) would trigger an Offline status in the CM GUI.  This can include network problems, CMS errors on the WAE, or a very busy CM.  Your CMS logs on both the WAE (which you already provided) and the CM may provide clues.

Or, in the case of Fast Device Offline Detection, anything that causes number of consecutive UDP packets sent at interval to fail would also trigger the Offline status in the CM GUI.  This is typically going to be caused by network problems or a busy network where UDP packets get dropped.

The WAE logs you provided seem to indicate some connectivity problem, or perhaps a CM that is too busy.  Look in the CM's logs for CMS to see if there are any logs pertaining to this WAE, and what they say about the connection attempts.  If there are no logs for this WAE, then look for other causes of connectivity problems such as dropped packets, duplex-mismatches, routing problems, etc.

By the way, Michael Holloway and I are hosting an "Ask the Experts - WAAS Monitoring and Reporting" discussion thread here:

https://supportforums.cisco.com/thread/2036888?tstart=0

This question, and any other similar questions you have would be very appropriate to post to that thread.  We look forward to hearing from you there. 

p.overgaard Mon, 08/23/2010 - 09:49

Hi

The CM is doing nothing. There is only 3 WAAS box'es running at the moment.

I can se the following log entries in the CM's log that covers the same timeframe as the earlier log from the remote WAAS:

2010 Aug 23 09:43:52 N25300 java: %WAAS-CMS-5-700001: cdm(DeviceStatusMonitor):
DeviceStatusMonitor changing 1 devices to offline status.
2010 Aug 23 09:43:52 N25300 java: %WAAS-CMS-5-700001: cdm(DeviceStatusMonitor):
Device n25320 with id CeConfig_4706 came offline
2010 Aug 23 09:47:18 N25300 java: %WAAS-CMS-5-700001: cdm(RpcWorker-1): Device n25320 with id CeConfig_4706 came online
.

** deleted log entries that i don't see relevant**

.

2010 Aug 23 10:03:45 N25300 java: %WAAS-CMS-5-700001: cdm(DeviceStatusMonitor):
DeviceStatusMonitor changing 1 devices to offline status.
2010 Aug 23 10:03:45 N25300 java: %WAAS-CMS-5-700001: cdm(DeviceStatusMonitor):
Device n25320 with id CeConfig_4706 came offline
2010 Aug 23 10:19:18 N25300 java: %WAAS-CMS-5-700001: cdm(RpcWorker-1): Device n25320 with id CeConfig_4706 came online

There is a RTT of aprox 120 ms between the CM and the remote WAAS. I have not seen any routing issues and OSPF has not dropped betweeen the 2 sites.

Fast WAE offline detection is not enablet.

I have enablet TCP keepalives. Could this be the reason ?

If the CM looses connectivity to the remote WAAS, does the WAAS then still optimize the traffic (I am presuming it does !) ?

Thanks

Per

jomerril Mon, 08/23/2010 - 10:09

To answer your last question first, yes, regardless of the WAE's status in the CM GUI, if the WAE is operational it will continue to optimize traffic.  However the WAE is configured and operating, it will continue to do so even if the CM thinks the WAE has gone offline.

The TCP keepalives shouldn't have anything to do with the online or offline status.  The TCP connections used for polling the CM are not long-lived connections.  It is a separate/new TCP connection every time the WAE polls the CM.  And, of course if you are using fast offline detection (mentioned earlier) then those are UDP packets and have nothing to do with TCP or TCP keepalives.

Have you determined whether you have fast offline detection enabled?  You can see this in the CM GUI by going to the main (My WAN) page, expanding the Configure menu on the left, and clicking "Fast Device Offline Detection".  If it is configured, you might try increasing the Heartbeat Fail Count and decreasing the Heartbeat Rate since you have such a small deployment.

Regardless of the offline detection method used, you may take packet captures from the WAE and from the CM, filtered on the WAE's IP address.  Don't worry about capturing the full packet size, since it will be encrypted traffic anyway.  Just see if the packets sent from the WAE match the packets received by the CM, that there aren't dropped/lost packets, retransmissions, etc.  Make sure that the WAE really is checking in with the CM regularly.

The interval that the WAE will contact the CM for CMS based offline detection is defined in the System.datafeed.pollrate of the My WAN->SystemProperties page.  By default it should be 300 seconds (5 minutes).

If this information isn't sufficient to help you identify and resolve the cause of the WAE periodically going Offline, it may be time to collect a detailed problem description, with specific timings and events, sysreports from WAE and CM, and the packet captures mentioned above, and bring them to Cisco TAC.

p.overgaard Mon, 08/23/2010 - 13:33

The Fast WAE offline detection is not enablet so it should not be this that is causing the problem.

I have seen another strange behavior in regard to telnetting to the device directly from remote. I have no problem in connecting to the device from the WCCP neiboor routers. I resolved that by clearing the arp table. The subnet the WAAS is located on is running GLBP and I an using "WCCP negotiated return" for the return traffic from the WAAS. The location has 2 WAN links to 2 WAN routers that is doing the WCCP redirection.

If the problem persists I will open a TAC case and I will post the result here.

Thanks again

Per

p.overgaard Tue, 08/24/2010 - 00:50

I think I have found the cause. Changeing GLBP to HSRP solved the problem. I am quite sure that the GLBP on the WAN routers was causing the problem. The routers are new C3945 running IOS 15.0(1)M3.

Per

jomerril Tue, 08/24/2010 - 05:55

Excellent.  Thank you for letting us know.

And, don't forget...  If you have any WAAS monitoring or reporting related questions, please post to our Ask The Expert thread, here:

https://supportforums.cisco.com/thread/2036888

We'll be answering questions on that thread through the end of this week.

Karthikeyan Ayy... Mon, 12/05/2011 - 01:27

Hello ,

I have WAVE 574 configured in inline mode with trunking enabled.But the staus shows offlline.any one can help to explain the reason?

interface InlineGroup 1/0

ip address 172.16.0.30 255.255.255.0

encapsulation dot1Q 2

inline vlan all

exit

interface InlineGroup 1/1

encapsulation dot1Q 2

inline vlan all

exit

wave-574#sh cms info
Device registration information :
Device Id                            = 1041
Device registered as                 = WAAS Application Engine
Current WAAS Central Manager         = x.x.x.x

Registered with WAAS Central Manager = x.x.x.x
Status                               = Offline
Time of last config-sync             = Sun Dec  4 05:59:17 2011

CMS services information :
Service cms_ce is running
sing-b1mr13-wave#

jomerril Mon, 12/05/2011 - 07:20

There isn’t really enough in your message to point to the root cause, but I can suggest some things to look at.

First, let’s understand what “offline” means. After a WAE registers with the CM, it is configured to check back with the CM every 5 minutes for configuration changes. This timer is configurable, but 5 minutes (300 seconds) is the default. The CM is expecting the WAE to check, and if the WAE fails to contact the CM after 3 consecutive polling cycles (15 minutes based on the default timer of 5 minutes polling interval), then the CM marks the WAE as “Offline”. This communication occurs over port 443.

In smaller deployments and lab environments, you can also configure a Fast Offline Detection mechanism, which uses UDP and is quicker to report an offline device. But this is not usually configured.

Almost always a device going Offline a simple communication problem where routing, configuration, firewall, cabling, or something is getting in the way and preventing the CM and WAE from communicating over port 443. In other words, something is preventing the WAE from successfully contacting the CM over port 443 for 15 consecutive minutes. In other less common cases, there maybe be a problem with the CMS or the registration to CMS.

Some things to look at…

Verify the ‘ip default-gateway’ is configured correctly. I’ve had cases in the past where the administrator configured the default-gateway with the same IP address as the WAE’s own interface IP address. After sync with CM, that CLI fails and the default-gateway configuration gets automatically removed from the WAE.

Verify the WAE can ping the CM’s IP address, and also that the CM can ping the WAE’s IP address. If not, track down why. Perhaps a routing, cabling or vlan configuration problem?

Look in the CMS errorlog file (in /local1/errorlog), and the syslog for error messages that correspond to the “offline” event.

Look at your ‘show alarm history detail’ for an alarm that might provide more information.

Actions

This Discussion