IP SLA / Tracking issue on 1811 router for VPN access

Jeffrey Warn · ‎09-15-2010

I have a Cisco 1811 router that has 2 ISP connections coming into it for VPN redundancy. I've noticed that over the last few weeks, the SLA which tracks the default gateway to the primary ISP has been failing, even though the SLA should be returning an OK status. The SLA seems to stay active for a few runs before stating there is No Connection, thus making it so the backup ISP is the preferred default route and the VPN endpoint comes up, leaving 2 QM_IDLE sessions on the router.

Here is a bit of the config:

crypto logging session

!

crypto isakmp policy 10

encr 3des

hash md5

authentication pre-share

group 5

crypto isakmp key MYKEY address 192.168.1.1

!

crypto ipsec security-association lifetime seconds 86400

!

crypto ipsec transform-set tunnelset esp-3des esp-md5-hmac

no crypto ipsec nat-transparency udp-encaps

!

crypto map VPNMAP 10 ipsec-isakmp

set peer 192.168.1.1

set transform-set tunnelset

match address 101

!

track 10 ip sla 10 reachability

delay down 10 up 30

interface FastEthernet1

description !!! Primary ISP !!!

ip address 10.1.1.2 255.255.255.0

ip nat outside

crypto map VPNMAP

!

interface FastEthernet2

switchport access vlan 899

!

interface Vlan899

description !!! Secondary ISP !!!

ip address 172.16.1.2 255.255.255.248

ip nat outside

crypto map VPNMAP

!

ip local policy route-map FAILOVER

!

ip sla 10

icmp-echo 192.168.1.1

!

ip sla schedule 10 life forever start-time now

!

access-list 199 permit icmp any host 192.168.1.1 echo

!

route-map FAILOVER permit 10

match ip address 199

set ip next-hop 10.1.1.1

!

ip route 0.0.0.0 0.0.0.0 10.1.1.1 track 10

ip route 0.0.0.0 0.0.0.0 172.16.1.1 250

!

And now the issue:

#show track 10

Track 10

IP SLA 10 reachability

Reachability is Down

54 changes, last change 00:58:48

Delay up 30 secs, down 10 secs

Latest operation return code: No connection

Tracked by:

STATIC-IP-ROUTING 0

#show ip sla statistics

IPSLAs Latest Operation Statistics

IPSLA operation id: 10

Type of operation: icmp-echo

Latest RTT: NoConnection/Busy/Timeout

Latest operation start time: *14:43:48.581 EDT Wed Sep 15 2010

Latest operation return code: No connection

Number of successes: 0

Number of failures: 52

Operation time to live: Forever

#show ip route

Gateway of last resort is 172.16.1.1 to network 0.0.0.0

#ping 192.168.1.1 source fa1

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.1.1, timeout is 2 seconds:

Packet sent with a source address of 10.1.1.2

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 60/69/80 ms

#show crypto isakmp sa

IPv4 Crypto ISAKMP SA

dst src state conn-id status

192.168.1.1 172.16.1.2 QM_IDLE 2099 ACTIVE

10.1.1.2 192.168.1.1 QM_IDLE 2098 ACTIVE

#show route-map FAILOVER

route-map FAILOVER, permit, sequence 10

Match clauses:

ip address (access-lists): 199

Set clauses:

ip next-hop 10.1.1.1

Policy routing matches: 110156 packets, 7051604 bytes

#show access-lists 199

Extended IP access list 199

10 permit icmp any host 192.168.1.1 echo (110157 matches)

When I add a static route for the primary ISP with a distance of 10 (ip route 0.0.0.0 0.0.0.0 10.1.1.1 10) the IP SLA recovers and the secondary VPN conn-id (172.16.1.2) gets removed from the crypto.

Is there anything missing from the configuration that would cause this? Or are there any useful debugging commands to see why the SLA seems to fail, even though I can ping the remote host just fine while it says it's unreachable?

Nagaraja Thanthry · ‎09-15-2010

Hello,

Instead of tracking connectivity to 192.168.1.1, can you track the connectivity to 10.1.1.1 (ISP next hop)? If the ICMP packet gets dropped along the path, then it could create a problem. Setting the tracking to the next hop ensures that as long as the ISP is up, the VPN tunnel through that interface stays up.

Regards,

NT

Jeffrey Warn · ‎09-16-2010

Thanks for the response. I originally thought about doing that but the issue there is, that IP address will pretty much always be up. The next hop is an ISP ethernet handoff, so connectivity to that IP address will be up 99% of the time (unless that ISP link dies completely).

Sorry I didn't illustrate that in my earlier post. Generally, it's the 1811 (10.1.1.2) -> (10.1.1.1) ISP gateway router -> INTERNET -> My ISP Gateway devices -> My ASA cluster (192.168.1.1)

I was curious more then not why it seemed the SLA pretty much just stopped working all together, which I think even if I was tracking the next hop, would have happened regardless.

Thanks

lapinmort · ‎12-01-2010

I have a similar problem on a couple of C1812 with IOS 12.4. I have deployed them with dual ISP configurations and route tracking.

Their configuration configuration are similar to this:

! With tracking where ISP1's router is at 10.10.10.1, and ISP2's router is at 20.20.20.1, both have the default AD of 1

ip route 0.0.0.0 0.0.0.0 10.10.10.1 track 100

ip route 0.0.0.0 0.0.0.0 20.20.20.1 track 200

! With a weight (AD) of 200. Without these, IP SLA failed and the tracked routes never came up. Once these were in the routing table the tracked routes would come up, and replace them since they have a lower AD.

ip route 0.0.0.0 0.0.0.0 10.10.10.1 200

ip route 0.0.0.0 0.0.0.0 20.20.20.1 200

Interface Fa0 is setup with ip 10.0.0.2/24

Interface Fa1 is setup with ip 20.20.20.2/24

ip sla 100

type icmp echo to 4.2.2.2 source interface fa0

scheduled to run forever

ip sla 200

type icmp echo to 4.2.2.2 source interface fa1

scheduled to run forever

track 100 ip sla 100 reachability

delay up 5, down 5

track 200 ip sla 200 reachability

delay up 5, down 5

On one of the routers, that configuration works as expected:

Initially you have the following entries in the routing tables:

0.0.0.0/0 [200/0] 10.10.10.1

* [200/0] 20.20.20.1

Then the IP SLA jobs 100 and 200 try pinging 4.2.2.2 using fa0, and fa1 respectively as source. Both jobs return success and bring up the tracking objects 100 and 200. That in turn, brings up up the tracked routes. Since the tracked default routes have a weight of 1 they replace the default routes with a weight of 200, and the routing table would show:

0.0.0.0/0 [1/0] 10.10.10.1

* [1/0] 20.20.20.1

On the other router (different location, different ISPs, but I use the same IPs for the sake of simplicity) the configuration is the similar, but the behavior is different. The routing table would initially show:

0.0.0.0/0 [200/0] 10.10.10.1

* [200/0] 20.20.20.1

The IP SLA 100 and 200 jobs never come up, because ping test to 4.2.2.2 with source fa0 or fa1 fail. But if you change the source to an internal VLAN, pings to 4.2.2.2 work just fine.

Note that there's no ZBF or access-list setup that could block traffic.

Now, let's say I shut down fa0. IP SLA for can suddenly ping 4.2.2.2 from fa1 using the route entry 0.0.0.0/0 20.20.20.1 200, and the tracked route 0.0.0.0/0 20.20.20.1 track 200 will come up and replace 0.0.0.0/0 20.20.20.1 200 in the routing table. If you run a no shut on fa0, nothing changes.

So the routing table looks like this:

0.0.0.0/0 [1/0] 20.20.20.1

I'm not sure what causes this behavior, but I guess it has something to do with the load-balancing algorithm.

The only way I can get fa0 and fa1 to load balance again, is by removing the tracked routes. Or perhaps by explicitly routing some traffic to the other interface using PBR.

Anyone knows why one router would behave differently from the other?

Thanks.

Rado

Jeffrey Warn · ‎12-01-2010

I've been running this on some client 1811 routers with success. Here is a basic config that should work:

!! I use a delay on the up/down to prevent some flapping issues that could arise.

track 11 ip sla 11 reachability

default-state up

delay down 90 up 120

!! Primary ISP

interface FastEthernet0

ip address 192.168.1.2 255.255.255.0

!! Secondary ISP / Backup

interface FastEthernet1

ip address 172.16.1.2 255.255.255.0

!! Main default route is tracked and primary ISP is preferred

ip route 0.0.0.0 0.0.0.0 192.168.1.1 track 11

!! Backup ISP has a floating static metric of 250, will only be used if SLA fails

ip route 0.0.0.0 0.0.0.0 172.16.1.1 250

!! Note 10.1.1.1 is the endpoint my clients try to connect to (my side)

ip sla 11

icmp-echo 10.1.1.1 source-interface FastEthernet0

frequency 30

ip sla schedule 11 life forever start-time now

!! Access list to permit only icmp for SLA check

access-list 199 permit icmp any host 10.1.1.1 echo

!! Local routing policy to make sure SLA check always goes out primary ISP interface

route-map FAILOVER permit 10

match ip address 199

set ip next-hop 192.168.1.1

!! Ties route map to local routing policy

ip local policy route-map FAILOVER

I prefer to ping something I'm trying to reach remotely, but you could ping the next hop as well (just as long as you can be sure that would be the reason for an outage). I found that in most cases, a problem would exist after the next hop (ISP gateway) so the SLA would never work.

lapinmort · ‎12-01-2010

That's an interesting configuration. However, it still won't allow me to get two load-balanced, tracked default routes.