WLC heartbeat and failure detection time

Unanswered Question
Jul 19th, 2012

Hi board,

I have a question regarding the failure detection time for an AP, if the corresponding WLC becomes unavailable.

We're talking about the current 7.2 release here - so no "old" technotes (<5.0) please :-))

I don't completely understand the software configuration guide, the available technotes and some Cisco Live Slides regarding this topic.

Also the current software configuration guide (7.2) doesn't explain  the new fields "AP Retransmit Count" and "AP Retransmit Interval"....  it's just stated that these values may be configured, but not the impact  and the functionality.

So here's how I understand the functionality:

WLC_heartbeat.JPG

In this example I used the following values:

  • AP heartbeat timeout: 3 seconds
  • Local mode AP fast heartbeat: 1 second
  • AP primary discovery timeout: 30 seconds (but this is not important in this context, I guess)

The AP probes the WLC every "AP heartbeat timeout" (3 seconds) with an echo request.

First question: How long does the AP wait for a response from the WLC? If I understand it correctly, the value "heartbeat timeout" is more or less something like the "hello intervall" (HSRP, EIGRP, OSPF).... but what's the dead time here?

When the WLC doesn't answer to the echo request, the AP starts sending fast heartbeat messages (every 1 second "Local mode AP fast heartbeat")

Second question: What is the dead interval for those fast messages?

After three failed fast heartbeats (at least I think it's three of them .... there's no document I know stating how much fast heartbeats need to fail), the AP switches to the backup controller.

What I want to achieve is something like a simple convergence time calculation, like everyone knows from routing, STP and so on :-)

- What is the failure detection time? (heartbeat timeout + <x> times fast heartbeat + .... I don't know)

- After the failure detection time, what's the needed time to change to the 2nd controller - performing a join, configuration and run state (assuming same config, same SW).... I know - this is kinda hard to answer.... so I'm already glad with the first question ("failure detection time).

Perhaps we can develop this together and create some kind of document here - I think a good document explaining this is needed.

At least I'm not aware of any document explaining this without leaving some open questions. I read some sections of RFC5415 (CAPWAP) as well - but this explained not all of it :-)

So I would be really grateful for some input regarding this topic.

Best regards

Joe

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Average Rating: 0 (0 ratings)
saravlak Sat, 07/21/2012 - 15:11

http://www.cisco.com/en/US/tech/tk722/tk809/technologies_configuration_example09186a008064a294.shtml

verify:

You can verify if the configuration works as expected. Power down the primary controller to which the AP is currently registered. The AP waits for the heartbeat time set, which is 30 seconds by default, to detect the failure of the primary WLC. After this period of time, the AP sends heartbeat messages seven more times, one per second, in efforts to find the primary WLC. If the AP does not hear from the primary WLC, the AP registers to an available WLC via the default process. Therefore, the process to detect the primary WLC failure and register to the secondary WLC takes approximately 80 seconds. Once the access point joins the secondary controller, it continues to send the discovery request to the primary controller in order to determine if the primary controller is back in operation. This can be determined with the help of the debug lwapp client packet command.

Note: The heartbeat message is similar to a keepalive message. The AP heartbeat is set to 30 seconds by default. You can adjust this heartbeat time, down to 1 second. However, if you have not made this adjustment since the last time that the AP heard from the WLC, 30 seconds pass before the AP realizes that it cannot reach the WLC.

http://www.cisco.com/en/US/products/ps6366/products_tech_note09186a00809a3f5d.shtml

johannes.luther Mon, 07/23/2012 - 01:07

Thank you for your reply. I also found this topic (https://supportforums.cisco.com/message/3193016#3193016) from 2010 which you quoted in your post.

I doubt it's still valid today with the current 7 release.

Review the Cisco Live Breakout Session "Design and Deployment of Enterprise WLANs" (BRKEWN-2010). It states something of three missed heartbeats instead of seven. I think the mentioned 80 seconds are not correct as well in the current releases - even with default timers.

I guess the new configuration parameters "AP retransmit count" and "AP retransmit interval" plays a role in this process as well.

So back to start - any ideas, links to documents or best-practices?

I will test this in the lab - but I don't have the equipment yet. I'm still in the conceptual phase and I want to understand the functionality completely :-) Sorry about being so annoying about that.

johannes.luther Tue, 07/24/2012 - 02:28

Just to follow up on this one. Here's a statement about this from the Cisco Live techtorial from 2009

In 5.0 controller code a new fast heartbeat request & response

mechanism was created so that a controller failure can be detected

within 4 seconds (1 second timeout + 3 retries)

So three retries in the 5 release.

In the same presentation (chapter AP troubleshooting) this is stated:

APs will failover to other WLCs if the LWAPP control plane

is interrupted

After either:

A missed heartbeat to WLC (sent every 30 seconds if no activity)

Or

A Non-ACK’d LWAPP control packet

Then:

The AP will send five successive heartbeats (each a second apart)

So five heartbeats now.

I think the number of heartbeat packets directly depends on the page-number of the document

This year I visited another techtorial at Cisco Live in London. If someone has the PDF slides from this year, could you please review the "Fast Failover" or "Backup Controller" slides and tell me what's stated there?

Thank you!

johannes.luther Wed, 08/15/2012 - 02:15

I did some testing and want to share the results - here's what really happens (Version 7.2.110.0 / local mode APs):

If using the "AP Fast Heartbeat Timeout", the AP sends each "AP Fast Heartbeat Timeout" a heartbeat request to the WLC during normal operation. If the WLC fails, the AP sends three retransmission after not getting a heartbeat response from the WLC. The AP waits for three seconds after each response. These three seconds are always the same - independend of the heartbeat timeout value!

Here's an illustration:

WLC-fast-heartbeat-final.JPG

Example 1: AP Heartbeat Timeout = 30 seconds / Local Mode AP Fast Heartbeat Timeout = 10 seconds

(debug output from AP with "debug capwap client event"

*Aug 15 07:55:56.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:55:56.071: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

*Aug 15 07:56:06.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:56:06.071: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

*Aug 15 07:56:16.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:56:16.071: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

--> Disable WLC 192.0.2.1

*Aug 15 07:56:26.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:56:29.071: %CAPWAP-3-EVENTLOG: Retransmission Count= 0 Max Re-Transmission Value=3

*Aug 15 07:56:29.071: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 07:56:32.071: %CAPWAP-3-EVENTLOG: Retransmission Count= 1 Max Re-Transmission Value=3

*Aug 15 07:56:32.071: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 07:56:35.071: %CAPWAP-3-EVENTLOG: Retransmission Count= 2 Max Re-Transmission Value=3

*Aug 15 07:56:35.071: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 07:56:36.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:56:38.071: %CAPWAP-3-EVENTLOG: Retransmission Count= 3 Max Re-Transmission Value=3

*Aug 15 07:56:38.071: %CAPWAP-3-EVENTLOG: Max retransmission count exceeded going back to DISCOVER mode.

Time between WLC failure and AP going back to discovery mode: 12 - 22 seconds

If the AP has knowledge about a secondary WLC, the join phase takes approximately 3 seconds if secondary WLC has the same image and config

Example 2: AP Heartbeat Timeout = 10 seconds / Local Mode AP Fast Heartbeat Timeout = 1 seconds

(debug output from AP with "debug capwap client event")

*Aug 15 08:40:49.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:49.323: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

*Aug 15 08:40:50.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:50.323: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

*Aug 15 08:40:51.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:51.323: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

--> Disable WLC 192.0.2.1

*Aug 15 08:40:52.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:53.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:54.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:55.323: %CAPWAP-3-EVENTLOG: Retransmission Count= 0 Max Re-Transmission Value=3

*Aug 15 08:40:55.323: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 08:40:55.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:56.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:57.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:58.323: %CAPWAP-3-EVENTLOG: Retransmission Count= 1 Max Re-Transmission Value=3

*Aug 15 08:40:58.323: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 08:40:58.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:59.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:00.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: Echo Interval Expired.

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: Echo Request sent to 192.0.2.1

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: Retransmission Count= 2 Max Re-Transmission Value=3

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:02.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:03.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:04.323: %CAPWAP-3-EVENTLOG: Retransmission Count= 3 Max Re-Transmission Value=3

*Aug 15 08:41:04.323: %CAPWAP-3-EVENTLOG: Max retransmission count exceeded going back to DISCOVER mode.

The fast heartbeat interval is shorter than the retransmit timeout, so three HB reqests are sent between the retransmission.

So a shorter heartbeat interval just influences the initial failure detection - it does not change anything about the 3 second retransmit timeout.

Time between WLC failure and AP going back to discovery mode: ~13 seconds

If the AP has knowledge about a secondary WLC, the join phase takes approximately 3 seconds if secondary WLC has the same image and config.

Best regards

Johannes

johannes.luther Wed, 08/15/2012 - 02:54

Sorry, I forgot something in my previous post. Beginning from 7.2, it is possible to set this retransmit interval and count (in my example above 3 seconds and 3 retries).

The system default is 3 second retransmit interval and 5 retries.

I observed the following behavior:

When using system defaults for retransmit interval and count (like in my previous example):

- 3 retries when using Fast-Heartbeats

- 5 retries (which is configured) when not using Fast-Heartbeats (just normal AP Heartbeat Timeout)

When using manually setting the retransmit interval and count (e.g. 5 seconds / 5 retries):

- 5 retries when using Fast-Heartbeats (if set to 5 retries)

- 5 retries when not using Fast-Heartbeats (just normal AP Heartbeat Timeout) (if set to 5 retries)

Be careful - from the 7.2. documentation, the retransmit parameters are not just for HB messages. They are for all CAPWAP related communication (request / response).

So here's an addition to my drawing:

The minimum values for the retransmit parameters are 3 retransmits with 2 second timeout.

So the time between WLC failure and AP discovery - when using fast heartbeats is:

Worst case: "AP Fast Heartbeat timeout" + ("Retransmit interval" * ("Retransmit count"+1))

Best case: "Retransmit interval" * ("Retransmit count"+1)

So the WLC failure detection can be tuned to ~9 seconds in total (don't ask me if the recommended or not :-) )

- AP Fast Heartbeat Timeout: 1 second

- Retransmit interval: 2

- Retransmit count: 3

--> 1 + (2 * (3+1)) = 9 seconds

Actions

Login or Register to take actions

This Discussion

Posted July 19, 2012 at 9:28 AM
Stats:
Replies:5 Avg. Rating:
Views:1400 Votes:0
Shares:0
Tags: No tags.

Discussions Leaderboard