ASA Dead Peer Detection - implementing a resilient solution for critical remote site

mitchen · ‎02-17-2010

Hi,

I wonder if anyone can help.

Our remote sites typically have an IPSEC VPN connection terminating on our head office ASA.

At one of the more critical remote sites, we are trying to implement some resilience for them to protect against a circuit failure. So, they have 2 Cisco 1841 routers, one connected to an ADSL line (secondary), one connected to a fixed rate 10Mbps circuit (primary).

I have configured the routers as an HSRP pair (and am running EIGRP between them) so that if the primary router or 10Mbps circuit fails, the secondary router takes over and traffic will flow from their LAN to the secondary router and then use the ADSL line to build the IPSEC tunnel to head office.

On the head office ASA, I have simply configured the 2 corresponding remote peers.

Now, I think as far as the remote site is concerned, it's working as expected.

However, earlier in the week we noticed that the site was complaining that things were running slow. When I checked it out, it seemed that the ASA had actually built a tunnel to the secondary router at the remote site so the ADSL line was being used rather than the 10Mbps circuit. There hadn't been any problems with the primary router or circuit and, indeed, the primary router was still active in the HSRP pair at the remote site.

Is there any way I can configure the dead peer detection on the ASA to favour one peer over the other to prevent this happening? (I have the primary peer listed first)

For the time being, I've simply removed the secondary peer from the ASA altogether so it will only establish an IPSEC tunnel with the primarty remote router but, this obviously. means my automatic resilience plans for the remote site have been thwarted too!

Can anyone advise on how to set this up as desired? i.e. so that the primary circuit will be used at all times unless there is a failure of the remote site's primary router or circuit (and then I want to automatically go BACK to the primary comms again once that problem has been fixed)

NOTE: unfortunately, as is all too often the case with these things, this has been implemented on a production network and my opportunities to test are limited.

Thanks for any advice/suggestions you can give.

andrew.prince · ‎02-17-2010

Something must have happended with he HSRP or EIGRP - check your logs.

What I see as a simple solution would be:-

1) Configure GRE tunnels between the Primary routers and the secondary routers.

2) Run EIGRP over the tunnels

3) Configure the 10mbs circuit VPN tunnel with a low EIGRP delay and High BW

4) Configure the ADSL circuit VPN tunnel with a High EIGRP delay and low BW

5) Configure the EIGRP timers to 1 sec hello 3 sec dead

6) Configure tunnel keepalives to 1 sec hello 3 sec dead

You will automatically failover and between within 3 sec of a circuit/VPN/router/site failover.

HTH>

mitchen · ‎02-18-2010

Thanks for the help, i think the current solution I have in place is along the lines of what you suggest. However, I'm not using GRE tunnels currently - though I should point out, the EIGRP is only running locally at the remote site, we do not have EIGRP employed at all at our head office.

However, I think I've sussed out why it went wrong (preventing similar from happening again may be another matter!)

Our head office internet pipe is provided by ISP1 and the ADSL circuit at the remote site is also provided by ISP1.

The 10Mbps circuit at the remote site is provided by ISP2.

So, what I think has happened is that (for only around a few minutes) ISP1 lost it's external internet connectivity (long and complicated story but this kind of thing has happened a few times in the past, thus we are using ISP2 for this new connection!) so the ASA dead peer detection has kicked in, seeing ISP2 peer as down and formed tunnel to ISP1 peer (ADSL router) Since ADSL router is using same ISP as head office internet pipe connection it wasn't affected by ISP1's "internet" outage.

BUT, at remote site, nothing is actually wrong - HSRP and EIGRP status is all fine as far as it's concerned.

So, the question remains really - is there any way of specifiying a preferred peer on the ASA? When the ASA does fail over to the 2nd peer, can I have it automatically go back to the "primary" peer when comms are restored to it?

I think my current solution "works" other than in the case where there is a problem in the "cloud" so to speak. Unfortunately, the reliability of ISP1 is such that this is probably the more likely fault scenario! Is there anything that would help me overcome this sort of issue?

Thanks.

andrew.prince · ‎02-18-2010

Sorry I am confused with what you have said regarding your

setup and what you think has gone wrong.

Are you monitoring any HSRP interfaces? Are you running any dynamic routing protocols over the VPN?

Why router the primary router choose to use the secondary router to reach the head office?

mitchen · ‎02-18-2010

Apologies, I realise I was rambling a bit – it’s quite a difficult set-up to try to explain concisely though!

Head office – main internet pipe, large number of IPSEC VPN connections to a large number of remote sites, typically with ADSL connections. The IPSEC VPN tunnels terminate on our ASA at head office. For the vast majority of our sites, this set-up is fine and we can “live” with any problems that may arise from a site having a router or circuit failure or whatever.

However, one particular site is more critical than the others. Hence we want to ensure connectivity with this office and introduce some added resilience for it.

This remote office has 2 routers – Cisco 1841s – one is “primary” with a 10Mbps circuit connected (via ISP2) and one is “secondary” with an ADSL circuit connected (via ISP1 – the same ISP used for our main head office circuit) They are running HSRP and EIGRP between them (running locally only, not across the VPN tunnels) so that if the primary router or 10Mbps circuit should fail the secondary router will take over as active and traffic should flow across the ADSL connection. From, admittedly limited, testing, the resilient aspect of this part of the solution seems to work.

Back at head office, on the ASA, I have configured 2 peers in the VPN set-up for this critical remote office. One peer is the IP address of the primary remote router (ISP2) and the other peer is the IP address of the secondary router (ISP1). The thought behind that being, if the 10Mbps circuit or primary router itself was to fail then obviously that peer would be unavailable and the ASA’s dead peer detection would make it move onto the other peer.

So, normal operation - all traffic from the remote site should flow via its primary router over the 10Mbps circuit to head office. Only if there is a problem with the primary router or 10Mbps circuit should traffic flow via the secondary router and ADSL line instead.

However, in this situation, what has happened is – there was NO fault with the 10Mbps circuit or router so locally everything at the remote site was fine. BUT, ISP1 has had some sort of “internet outage” meaning that their connectivity with other networks has been lost for a few minutes. The impact of this was that the Head Office ASA could no longer “see” the primary router at the remote site so, because of dead peer detection, it moved onto the secondary peer (the ADSL router) which it could still “see” because it uses the same ISP (connectivity internal to the ISP was ok, only connectivity external to the ISP was affected at that time)

So, even though there was no issue at the remote office, it’s the head office ASA that’s decided to use the “secondary” peer to establish the tunnel to (and it continued to use that secondary peer even when comms were restored to the primary i.e. when the ISP's "internet outage" was over) What I want to know is whether there is a way for the ASA to “prefer” one peer over the other and, in the event of a failure, whether I can have the ASA automatically fall back to the "primary" peer once comms are re-established to it? (Or any other way round this?)

Does that make more sense? Sorry, it’s quite difficult to explain!

Ps just in case anyone is thinking “why are you bothering with resilience for the remote office, what about your head office?” we do already have resilience for our main head office circuits too!

andrew.prince · ‎02-20-2010

What I am havin trouble with is this:-

The routers are running EIGRP and HSRP - fine

They have VPN tunnels - with NO dynamic routing running.

so IF the VPN tunnel from router 1 is OK - but the ASA in the head office decides its dead and creates a VPN connection to router 2, where is the logical failover between router 1 and 2 that they decide between them that router 1 is still the primary local router, but it passes off traffic to the head office via router 2 - this makes no sense to me?

Unless you either have some monitoring or dynamic failover routing?

mitchen · ‎02-23-2010

Yes, that’s what I don’t understand either. But that’s what seems to have happened. (As is typical with these kind of things, when the incident occurred, I was concentrating more on getting service back up and running again properly than finding out what had actually happened but what seemed to be happening was that the tunnel between the secondary remote router and the ASA was established and stayed up whereas it looked like the primary router was trying to build a tunnel to the ASA again but it was constantly being torn down each time?) Like you, I’m confused as to why/how the traffic from the remote site could still be directed via the secondary router when there were no HSRP or EIGRP changes at that end? (Having said that, the site was complaining of “slowness/connectivity” issues so it may have been that, despite the tunnel being “up” on the secondary router, traffic was not actually flowing properly through it from the remote site?)

Maybe I’ll rephrase my original question slightly though:

· In a set-up like ours, what should happen if the ASA detects the 1st peer in its list as down and moves onto the 2^nd peer?

· When the 1st peer becomes available again, what should happen from the ASA perspective?

I still get the feeling (or maybe it’s just clutching at straws!) that there is something glaringly simple and obvious that I’m missing which is at the root of all this. However, I’m trying to organise a maintenance window at some stage so that I can look into it in more depth though as it seems there are too many questionmarks over what actually occurred at present.

Thanks for your advice (and patience!) so far.

Message was edited by: mitchen - corrected typo - should have been "when the 1st peer becomes available again" NOT "when the 2nd peer becomes available again"

andrew.prince · ‎02-23-2010

I think you need to have a really good look at your logs for the event - to see what went on.

Your questions:-

· In a set-up like ours, what should happen if the ASA detects the 1st peer in its list as down and moves onto the 2^nd peer? Yes

· When the 2^nd peer becomes available again, what should happen from the ASA perspective? 1st peer you mean, if the 1st has failed the ASA will fail over to the second. If the 1st is OK again - nothing will happen, the 2nd will have to fail for the 1st to take over.

HTH>

mitchen · ‎02-23-2010

Thanks (and yes, sorry, that was a typo above, I meant the "1st peer"). So, forgetting about the initial problem just for a moment and trying to clarify exactly what should happen in this set-up...

If 1st peer is marked as failed, ASA builds tunnel to 2nd peer - this means, in a situation where there is a problem with the remote site's primary router or circuit, the solution will work in that the ASA will no longer be able to detect the 1st peer so will failover to the 2nd peer and HSRP/EIGRP at the remote site will also failover to the secondary router so comms will be established successfully via the slower ADSL link.

However, the problem with my set-up will be when the primary router or circuit returns to service again (and HSRP/EIGRP at the remote site updates accordingly) The remote site will be trying to send traffic via the primary router again while the ASA will still be trying to use the 2nd peer. Is that correct? So what would happen in this scenario? Presumably there will be a conflict of some sort and that would affect the comms between the remote site and head office?

And, what I'm trying to establish is - is there any way round this from the ASA perspective? i.e. is there any way I could get it to recognise that when the 1st peer becomes available again that it should prefer it over the 2nd peer? If not, is there a simple manual workaround e.g. could I simply clear the secondary peer SA using "clear crypto ipsec sa peer w.x.y.z" from the ASA CLI to have it rebuild the tunnel using the 1st peer again (rather than having to remove the 2nd peer from the config altogether to force it back over to the 1st peer!)