WLC heartbeat and failure detection time

Unanswered Question
Jul 19th, 2012

Hi board,

I have a question regarding the failure detection time for an AP, if the corresponding WLC becomes unavailable.

We're talking about the current 7.2 release here - so no "old" technotes (<5.0) please :-))

I don't completely understand the software configuration guide, the available technotes and some Cisco Live Slides regarding this topic.

Also the current software configuration guide (7.2) doesn't explain  the new fields "AP Retransmit Count" and "AP Retransmit Interval"....  it's just stated that these values may be configured, but not the impact  and the functionality.

So here's how I understand the functionality:

WLC_heartbeat.JPG

In this example I used the following values:

  • AP heartbeat timeout: 3 seconds
  • Local mode AP fast heartbeat: 1 second
  • AP primary discovery timeout: 30 seconds (but this is not important in this context, I guess)

The AP probes the WLC every "AP heartbeat timeout" (3 seconds) with an echo request.

First question: How long does the AP wait for a response from the WLC? If I understand it correctly, the value "heartbeat timeout" is more or less something like the "hello intervall" (HSRP, EIGRP, OSPF).... but what's the dead time here?

When the WLC doesn't answer to the echo request, the AP starts sending fast heartbeat messages (every 1 second "Local mode AP fast heartbeat")

Second question: What is the dead interval for those fast messages?

After three failed fast heartbeats (at least I think it's three of them .... there's no document I know stating how much fast heartbeats need to fail), the AP switches to the backup controller.

What I want to achieve is something like a simple convergence time calculation, like everyone knows from routing, STP and so on :-)

- What is the failure detection time? (heartbeat timeout + <x> times fast heartbeat + .... I don't know)

- After the failure detection time, what's the needed time to change to the 2nd controller - performing a join, configuration and run state (assuming same config, same SW).... I know - this is kinda hard to answer.... so I'm already glad with the first question ("failure detection time).

Perhaps we can develop this together and create some kind of document here - I think a good document explaining this is needed.

At least I'm not aware of any document explaining this without leaving some open questions. I read some sections of RFC5415 (CAPWAP) as well - but this explained not all of it :-)

So I would be really grateful for some input regarding this topic.

Best regards

Joe

I have this problem too.
0 votes
Saravanan Lakshmanan Sat, 07/21/2012 - 15:11

http://www.cisco.com/en/US/tech/tk722/tk809/technologies_configuration_example09186a008064a294.shtml

verify:

You can verify if the configuration works as expected. Power down the primary controller to which the AP is currently registered. The AP waits for the heartbeat time set, which is 30 seconds by default, to detect the failure of the primary WLC. After this period of time, the AP sends heartbeat messages seven more times, one per second, in efforts to find the primary WLC. If the AP does not hear from the primary WLC, the AP registers to an available WLC via the default process. Therefore, the process to detect the primary WLC failure and register to the secondary WLC takes approximately 80 seconds. Once the access point joins the secondary controller, it continues to send the discovery request to the primary controller in order to determine if the primary controller is back in operation. This can be determined with the help of the debug lwapp client packet command.

Note: The heartbeat message is similar to a keepalive message. The AP heartbeat is set to 30 seconds by default. You can adjust this heartbeat time, down to 1 second. However, if you have not made this adjustment since the last time that the AP heard from the WLC, 30 seconds pass before the AP realizes that it cannot reach the WLC.

http://www.cisco.com/en/US/products/ps6366/products_tech_note09186a00809a3f5d.shtml

Johannes Luther Mon, 07/23/2012 - 01:07

Thank you for your reply. I also found this topic (https://supportforums.cisco.com/message/3193016#3193016) from 2010 which you quoted in your post.

I doubt it's still valid today with the current 7 release.

Review the Cisco Live Breakout Session "Design and Deployment of Enterprise WLANs" (BRKEWN-2010). It states something of three missed heartbeats instead of seven. I think the mentioned 80 seconds are not correct as well in the current releases - even with default timers.

I guess the new configuration parameters "AP retransmit count" and "AP retransmit interval" plays a role in this process as well.

So back to start - any ideas, links to documents or best-practices?

I will test this in the lab - but I don't have the equipment yet. I'm still in the conceptual phase and I want to understand the functionality completely :-) Sorry about being so annoying about that.

Johannes Luther Tue, 07/24/2012 - 02:28

Just to follow up on this one. Here's a statement about this from the Cisco Live techtorial from 2009

In 5.0 controller code a new fast heartbeat request & response

mechanism was created so that a controller failure can be detected

within 4 seconds (1 second timeout + 3 retries)

So three retries in the 5 release.

In the same presentation (chapter AP troubleshooting) this is stated:

APs will failover to other WLCs if the LWAPP control plane

is interrupted

After either:

A missed heartbeat to WLC (sent every 30 seconds if no activity)

Or

A Non-ACK’d LWAPP control packet

Then:

The AP will send five successive heartbeats (each a second apart)

So five heartbeats now.

I think the number of heartbeat packets directly depends on the page-number of the document

This year I visited another techtorial at Cisco Live in London. If someone has the PDF slides from this year, could you please review the "Fast Failover" or "Backup Controller" slides and tell me what's stated there?

Thank you!

Johannes Luther Wed, 08/15/2012 - 02:15

I did some testing and want to share the results - here's what really happens (Version 7.2.110.0 / local mode APs):

If using the "AP Fast Heartbeat Timeout", the AP sends each "AP Fast Heartbeat Timeout" a heartbeat request to the WLC during normal operation. If the WLC fails, the AP sends three retransmission after not getting a heartbeat response from the WLC. The AP waits for three seconds after each response. These three seconds are always the same - independend of the heartbeat timeout value!

Here's an illustration:

WLC-fast-heartbeat-final.JPG

Example 1: AP Heartbeat Timeout = 30 seconds / Local Mode AP Fast Heartbeat Timeout = 10 seconds

(debug output from AP with "debug capwap client event"

*Aug 15 07:55:56.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:55:56.071: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

*Aug 15 07:56:06.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:56:06.071: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

*Aug 15 07:56:16.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:56:16.071: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

--> Disable WLC 192.0.2.1

*Aug 15 07:56:26.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:56:29.071: %CAPWAP-3-EVENTLOG: Retransmission Count= 0 Max Re-Transmission Value=3

*Aug 15 07:56:29.071: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 07:56:32.071: %CAPWAP-3-EVENTLOG: Retransmission Count= 1 Max Re-Transmission Value=3

*Aug 15 07:56:32.071: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 07:56:35.071: %CAPWAP-3-EVENTLOG: Retransmission Count= 2 Max Re-Transmission Value=3

*Aug 15 07:56:35.071: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 07:56:36.071: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 07:56:38.071: %CAPWAP-3-EVENTLOG: Retransmission Count= 3 Max Re-Transmission Value=3

*Aug 15 07:56:38.071: %CAPWAP-3-EVENTLOG: Max retransmission count exceeded going back to DISCOVER mode.

Time between WLC failure and AP going back to discovery mode: 12 - 22 seconds

If the AP has knowledge about a secondary WLC, the join phase takes approximately 3 seconds if secondary WLC has the same image and config

Example 2: AP Heartbeat Timeout = 10 seconds / Local Mode AP Fast Heartbeat Timeout = 1 seconds

(debug output from AP with "debug capwap client event")

*Aug 15 08:40:49.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:49.323: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

*Aug 15 08:40:50.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:50.323: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

*Aug 15 08:40:51.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:51.323: %CAPWAP-3-EVENTLOG: HeartBeat response from 192.0.2.1

--> Disable WLC 192.0.2.1

*Aug 15 08:40:52.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:53.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:54.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:55.323: %CAPWAP-3-EVENTLOG: Retransmission Count= 0 Max Re-Transmission Value=3

*Aug 15 08:40:55.323: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 08:40:55.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:56.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:57.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:58.323: %CAPWAP-3-EVENTLOG: Retransmission Count= 1 Max Re-Transmission Value=3

*Aug 15 08:40:58.323: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 08:40:58.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:40:59.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:00.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: Echo Interval Expired.

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: Echo Request sent to 192.0.2.1

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: Retransmission Count= 2 Max Re-Transmission Value=3

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: Sending packet to AC

*Aug 15 08:41:01.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:02.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:03.323: %CAPWAP-3-EVENTLOG: HeartBeat request sent to 192.0.2.1

*Aug 15 08:41:04.323: %CAPWAP-3-EVENTLOG: Retransmission Count= 3 Max Re-Transmission Value=3

*Aug 15 08:41:04.323: %CAPWAP-3-EVENTLOG: Max retransmission count exceeded going back to DISCOVER mode.

The fast heartbeat interval is shorter than the retransmit timeout, so three HB reqests are sent between the retransmission.

So a shorter heartbeat interval just influences the initial failure detection - it does not change anything about the 3 second retransmit timeout.

Time between WLC failure and AP going back to discovery mode: ~13 seconds

If the AP has knowledge about a secondary WLC, the join phase takes approximately 3 seconds if secondary WLC has the same image and config.

Best regards

Johannes

Johannes Luther Wed, 08/15/2012 - 02:54

Sorry, I forgot something in my previous post. Beginning from 7.2, it is possible to set this retransmit interval and count (in my example above 3 seconds and 3 retries).

The system default is 3 second retransmit interval and 5 retries.

I observed the following behavior:

When using system defaults for retransmit interval and count (like in my previous example):

- 3 retries when using Fast-Heartbeats

- 5 retries (which is configured) when not using Fast-Heartbeats (just normal AP Heartbeat Timeout)

When using manually setting the retransmit interval and count (e.g. 5 seconds / 5 retries):

- 5 retries when using Fast-Heartbeats (if set to 5 retries)

- 5 retries when not using Fast-Heartbeats (just normal AP Heartbeat Timeout) (if set to 5 retries)

Be careful - from the 7.2. documentation, the retransmit parameters are not just for HB messages. They are for all CAPWAP related communication (request / response).

So here's an addition to my drawing:

The minimum values for the retransmit parameters are 3 retransmits with 2 second timeout.

So the time between WLC failure and AP discovery - when using fast heartbeats is:

Worst case: "AP Fast Heartbeat timeout" + ("Retransmit interval" * ("Retransmit count"+1))

Best case: "Retransmit interval" * ("Retransmit count"+1)

So the WLC failure detection can be tuned to ~9 seconds in total (don't ask me if the recommended or not :-) )

- AP Fast Heartbeat Timeout: 1 second

- Retransmit interval: 2

- Retransmit count: 3

--> 1 + (2 * (3+1)) = 9 seconds

Arie - Wed, 11/11/2015 - 05:24

Hello Johannes,

Thanks for your explanation and your sharing. It's very helpful for me to understand how the AP will look the other controller after the primary is down.

But, I still confused between AP heartbeat Timeout and AP Fast-Heartbeats. From your lab result, I couldn't see where the AP Heartbeat timeout take place. I just saw that only AP Fast-Heartbeats took place.

Maybe you can tell me what the difference between them if I enable them both and if I not.

Thank you.

Arie

Johannes Luther Thu, 11/12/2015 - 22:08

Hi Arie,

wow - you were digging out a really old post :)

So right now it's only guessing but:

You can configure the fast heartbeat timer only for access points in local and FlexConnect modes.

From the SW configuration guide.

So I guess the normal HB is taking place in all other operating modes (e.g. bridge).

Also the Fast HB is not a mandatory setting. So if you don't configure it, it's not used.

Arie - Fri, 11/13/2015 - 03:54

Hi Johannes,

i just googling it and found this post maybe can help me to understand heartbeat. :)

Okay, I understand that Fast HB is optional. But, what condition do I need to set Fast HB?

From your lab result, when you set Fast HB, I can see that HB request-response use the Fast HB time. How about the normal HB? Is it because you enable Fast HB so the normal HB doesn't take effect?

Thank you.

Arie

Johannes Luther Mon, 12/07/2015 - 06:09

Correct (at least I guess so). As soon as you configure Fast HB, the normal HB is no longer used.

At least I saw this in my test results 3 years (wow - time flies when you're having fun) ago. I guess it was a 7.2 release. Don't know if it's still valid for newer releases.

With SSO the failure detection time and heartbeats between WLC and AP plays a less important role for convergence.

Actions

This Discussion