Solved: Intermittent tunnel issues

Jorn Halsen Lukassen · ‎10-25-2010

On one of our IOS-IOS tunnels, the isakmp seems to corrupt itself from time to time, causing the tunnel to shut itself down.

See bold text from log (debug crypto isakmp):

017975: Oct 25 10:03:32.702 CET: %CRYPTO-6-ISAKMP_ON_OFF: ISAKMP is ON

017976: Oct 25 10:03:34.698 CET: %LINK-3-UPDOWN: Interface Tunnel0, changed state to up

017977: Oct 25 10:03:47.110 CET: ISAKMP:(0):: Profile dmvpn_spokes_isakmp discarded due to VRF mismatch * * *

017978: Oct 25 10:03:47.110 CET: ISAKMP:(0):: Have you put proper FVRF in "match id ip-address" command?

017979: Oct 25 10:03:47.110 CET: ISAKMP:(0):deleting SA reason "Receive initial contact" state (I) MM_NO_STATE (peer [REMOTE_IP])

017980: Oct 25 10:03:47.110 CET: ISAKMP:(0):deleting SA reason "Receive initial contact" state (I) MM_NO_STATE (peer [REMOTE_IP])

017981: Oct 25 10:03:47.138 CET: ISAKMP:(1041): IPSec policy invalidated proposal with error 32

017982: Oct 25 10:03:47.138 CET: ISAKMP:(1041): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])

017983: Oct 25 10:03:47.138 CET: ISAKMP:(1041):deleting node 570435705 error TRUE reason "QM rejected"

017984: Oct 25 10:04:02.726 CET: ISAKMP:(1041): IPSec policy invalidated proposal with error 32

017985: Oct 25 10:04:02.726 CET: ISAKMP:(1041): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])

Now this is definitely strange, since _no_ configuration has been changed on either side.

Even stranger - shutting/clearing/reiniting right away just gives the same error. However, if we keep the tunnel ifaces shut for a while (say ~5 mins), tunnel comes back up.

Could this be due to something being cached externally somewhere along the route ?

Is there perhaps a timeout on either of the routers that can be reset that I'm not aware of ?

IOS bug (v12.4(1r)) ?

Marcin Latosiewicz · ‎11-05-2010

Jorn,

To me it looks like a problem with isakmp profile.

I'm not sure you need to set the ivrf on profile, since you're doing ivrf on tunnel interface. I would need to run a lab test - looks pretty easy to reproduce.

I will be away for next 10 days, what I would like to suggest is to open a service request and specify this thread as reference.

Marcin

View solution in original post

Marcin Latosiewicz · ‎10-25-2010

Jorn,

Maybe the problem is as "easy" as, who is intiating the tunnel?

I don't remember any bug that would be specfic to this release, but going for something more recent would be a decent first step.

Can you attach configs? "sh run | s crypto" "sh run int 0" would be a minimum

Marcin

Jorn Halsen Lukassen · ‎10-25-2010

Sorry about not attaching config.

Short: RouterA has VRFs, RouterB no VRFs (hence config varies).

First post w/log included was from RouterA.

-------------

RouterA config:

-------------

crypto keyring dmvpn_spokes_keys vrf hmvnett
description DMVPN SPOKE TUNNEL KEYS
pre-shared-key address [RouterB_IP] key ******
crypto isakmp policy 1
encr aes
authentication pre-share
crypto isakmp keepalive 20
crypto isakmp profile dmvpn_spokes_isakmp
   vrf hmvnett
   keyring dmvpn_spokes_keys
   match identity address [REMOTE_IP] 255.255.255.255
crypto ipsec transform-set strong esp-3des esp-sha-hmac
crypto ipsec profile dmvpn_hub
set security-association lifetime seconds 120
set transform-set strong
set isakmp-profile dmvpn_spokes_isakmp
!
interface Tunnel0
bandwidth 8000
ip vrf forwarding hmvvpn
ip address 10.47.2.9 255.255.255.252
no ip redirects
ip mtu 1460
ip tcp adjust-mss 1300
tunnel source GigabitEthernet0/1.50
tunnel destination [RouterB_IP]
tunnel mode ipsec ipv4
tunnel vrf hmvnett
tunnel protection ipsec profile dmvpn_hub
end
!
interface GigabitEthernet0/1.50
encapsulation dot1Q 50
ip vrf forwarding hmvnett
ip address [RouterA_IP] 255.255.255.240
ip nat outside
ip virtual-reassembly max-reassemblies 32
no ip mroute-cache
no snmp trap link-status

-------------

RouterB config:

-------------

crypto isakmp policy 1
encr aes
authentication pre-share
crypto isakmp key ****** address [RouterA_IP]
crypto isakmp keepalive 20
crypto ipsec transform-set strong esp-3des esp-sha-hmac
crypto ipsec profile dmvpn_spoke
set security-association lifetime seconds 120
set transform-set strong
!
interface Tunnel0
bandwidth 8000
ip address 10.47.2.10 255.255.255.252
no ip redirects
ip mtu 1460
ip tcp adjust-mss 1300
tunnel source GigabitEthernet0/0
tunnel mode ipsec ipv4
tunnel destination [RouterA_IP]
tunnel protection ipsec profile dmvpn_spoke
!
interface GigabitEthernet0/0
ip address [RouterB_IP] 255.255.255.252
ip nat outside
ip virtual-reassembly
duplex auto
speed auto

Marcin Latosiewicz · ‎10-25-2010

Jorn,

Odd ... you're using a FVRF

interface Tunnel0
bandwidth 8000
ip vrf forwarding hmvvpn
ip address 10.47.2.9 255.255.255.252
no ip redirects
ip mtu 1460
ip tcp adjust-mss 1300
tunnel source GigabitEthernet0/1.50
tunnel destination [RouterB_IP]
tunnel mode ipsec ipv4
tunnel vrf hmvnett
tunnel protection ipsec profile dmvpn_hub
end

And your isakmp profile does not mention the profile in match identity.

match identity address [REMOTE_IP] 255.255.255.255

I believe this should be corrected to :

match identity address [REMOTE_IP] 255.255.255.255 hmvnett

From my lab:

BB_966(config)#crypto isakmp profile TEST_PROFILE
% A profile is deemed incomplete until it has match identity statements
BB_966(conf-isa-prof)#match identity add 1.2.3.4 ?
A.B.C.D specify mask
WORD Specify the F VRF (default global)

BB_966(conf-isa-prof)#match identity add 1.2.3.4

Jorn Halsen Lukassen · ‎10-25-2010

Could very much be a misconfiguration. Been following examples from the web, but not too many of them contain setups with VRFs.

Strange though, that things have been working at all ? Or could there be some implicit "natural" explanation to this ?

Tried deploying the changes you suggested, and things seem to work fine. Will leave config as it is for now, and see if this is has been causing the trouble. Will keep you posted if there are any oddities.

Thanks for helping !!

Marcin Latosiewicz · ‎10-25-2010

Jorn,

Sure thing, check it out.

My take on this is that you were seeing different behavior depending who was the initiator ;-)

It's been a while since I was playing with VTI setup if I find some time I'll check it out, but I don't have any setup I could adapt right now.

Marcin

Jorn Halsen Lukassen · ‎11-04-2010

Indeed, explicitly adding fvrf reference to isakmp policy took care of that error message. Just as you stated

However, the drops are still there. The tunnel went down last night, logs reading (w/isakmp error debugging) :

--------
RouterA
--------
046177: Nov 4 08:11:38.105 CET: ISAKMP:(1067): IPSec policy invalidated proposal with error 32
046178: Nov 4 08:11:38.109 CET: ISAKMP:(1067): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])
046179: Nov 4 08:12:08.106 CET: ISAKMP:(1067): IPSec policy invalidated proposal with error 32
046180: Nov 4 08:12:08.106 CET: ISAKMP:(1067): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])
046181: Nov 4 08:12:08.110 CET: ISAKMP:(1067):deleting node -960883661 error TRUE reason "QM rejected"
046182: Nov 4 08:12:38.106 CET: ISAKMP:(1067): IPSec policy invalidated proposal with error 32
046183: Nov 4 08:12:38.110 CET: ISAKMP:(1067): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])
046184: Nov 4 08:12:38.110 CET: ISAKMP:(1067):deleting node 180877383 error TRUE reason "QM rejected"
046185: Nov 4 08:12:46.238 CET: ISAKMP:(1067): IPSec policy invalidated proposal with error 32
046186: Nov 4 08:12:46.238 CET: ISAKMP:(1067): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])
046187: Nov 4 08:13:16.235 CET: ISAKMP:(1067): IPSec policy invalidated proposal with error 32
046188: Nov 4 08:13:16.235 CET: ISAKMP:(1067): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])
046189: Nov 4 08:13:46.235 CET: ISAKMP:(1067): IPSec policy invalidated proposal with error 32
046190: Nov 4 08:13:46.235 CET: ISAKMP:(1067): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])
046191: Nov 4 08:13:46.239 CET: ISAKMP:(1067):deleting node 1006172280 error TRUE reason "QM rejected"
046192: Nov 4 08:14:16.316 CET: ISAKMP:(1068):Unable to copy name into saved_grpname
046193: Nov 4 08:14:16.340 CET: ISAKMP:(1068): IPSec policy invalidated proposal with error 32
046194: Nov 4 08:14:16.340 CET: ISAKMP:(1068): phase 2 SA policy not acceptable! (local [LOCAL_IP] remote [REMOTE_IP])
046195: Nov 4 08:14:16.340 CET: ISAKMP:(1068):deleting node 1429853498 error TRUE reason "QM rejected"
046196: Nov 4 08:14:41.008 CET: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel0, changed state to up

--------
RouterB
--------
Nov 4 08:06:06.799 CET: ISAKMP:(1014):deleting node -407957394 error TRUE reason "Delete Larval"
Nov 4 08:06:36.803 CET: ISAKMP:(1014):deleting node 542855624 error TRUE reason "Delete Larval"
Nov 4 08:06:36.807 CET: ISAKMP:(1014):deleting node -1765312807 error TRUE reason "Delete Larval"
Nov 4 08:07:06.803 CET: ISAKMP:(1014):deleting node 1898221586 error TRUE reason "Delete Larval"
Nov 4 08:07:36.808 CET: ISAKMP:(1014):deleting node 852202429 error TRUE reason "Delete Larval"
Nov 4 08:07:49.308 CET: ISAKMP:(1014):deleting node 1987274176 error TRUE reason "Delete Larval"
Nov 4 08:08:19.304 CET: ISAKMP:(1014):deleting node -1373640289 error TRUE reason "Delete Larval"
Nov 4 08:08:49.304 CET: ISAKMP:(1014):deleting node 1216149038 error TRUE reason "Delete Larval"
Nov 4 08:09:19.308 CET: ISAKMP:(1014):deleting node -695960059 error TRUE reason "Delete Larval"
Nov 4 08:09:30.684 CET: ISAKMP:(1014):deleting node 395184133 error TRUE reason "Delete Larval"
Nov 4 08:10:00.684 CET: ISAKMP:(1014):deleting node 327459661 error TRUE reason "Delete Larval"
Nov 4 08:10:00.688 CET: ISAKMP:(1014):deleting node 440968144 error TRUE reason "Delete Larval"
Nov 4 08:10:30.785 CET: ISAKMP:(1014):deleting node 131026033 error TRUE reason "Delete Larval"
Nov 4 08:11:00.685 CET: ISAKMP:(1014):deleting node -1683986982 error TRUE reason "Delete Larval"
Nov 4 08:11:08.117 CET: ISAKMP:(1014):deleting node 393184170 error TRUE reason "Delete Larval"
Nov 4 08:11:38.121 CET: ISAKMP:(1014):deleting node -1946259847 error TRUE reason "Delete Larval"
Nov 4 08:12:08.117 CET: ISAKMP:(1014):deleting node -960883661 error TRUE reason "Delete Larval"
Nov 4 08:12:38.121 CET: ISAKMP:(1014):deleting node 180877383 error TRUE reason "Delete Larval"
Nov 4 08:12:46.249 CET: ISAKMP:(1014):deleting node -929479666 error TRUE reason "Delete Larval"
Nov 4 08:13:16.245 CET: ISAKMP:(1014):deleting node 626584940 error TRUE reason "Delete Larval"
Nov 4 08:13:46.250 CET: ISAKMP:(1014):deleting node 1006172280 error TRUE reason "Delete Larval"
Nov 4 08:14:07.726 CET: ISAKMP:(0):Can't decrement IKE Call Admission Control stat outgoing_active since it's already 0.
Nov 4 08:14:16.350 CET: ISAKMP:(1015):deleting node 1429853498 error TRUE reason "Delete Larval"
Nov 4 08:14:34.418 CET: ISAKMP:(0):Can't decrement IKE Call Admission Control stat outgoing_active since it's already 0.
Nov 4 08:14:41.434 CET: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel0, changed state to up

This morning I tried the following:

clear cry sa + clear cry sess => no luck

clear cry dh + clear cry sess => success

Any suggestions on this ?

I've also added periodic DPDs and increased the anti-replay window size to 128 since this thread started, hoping that these would prevent and/or act as countermeasures. Which they clearly didn't.

Marcin Latosiewicz · ‎11-04-2010

Jorn,

Periodic DPDs are only a way to implement polling if IKE is up, while the messages you're seeing are related to (for some reason) not being able to agree on proposals on phase 2 IPSec, phase 1 which DPDs will use is already up.

We can for sure alleviate the situation, byt setting phase 2 lifetime to a value which is not the minimum - if you want security you add PFS instead of having a very short phase 2 lifetime.

Longer phase 2 liftime means less renegotations of phase 2. Would you mind setting phase 2 lifetime to default (1 hour) and if you want increased security add PFS?

PFS and lifetime settings need to match on both sides.

All in all we'd need to have full debug while the issue starts.

debug cry isa

deb crypto ipsec

deb cry kmi

both sides.

What it looks to me (and this is just me throwing some thoughts in the air, not any factual data - since data is scarce) is either mem leak (would explain why you had to clear also dh ... but I have my doubts) or you're stressing out the control plane due to frequent rekeys which leads to some sort of de-synchronization scenario...

Would it be possible for you to apply PFS and setting back lifetime to default values?

Marcin

Jorn Halsen Lukassen · ‎11-04-2010

Will definitely try what you are suggesting here.

Changes will be deployed when production ends today.

Will dump logs as soon as they contain anything relevant (eg when the tunnel goes down) - or would you rather that I supply logs from normal operation aswell ?

Thanks again btw.

Marcin Latosiewicz · ‎11-04-2010

Jorn,

The debugs - you can enabled them freely when you see the issue

Ideally they should be enabled on both sides just before the issue starts, but since we don't have the luxury to know this- enabling during the issue should be fine.

FYI, if too many debugs are popping up you can lmit them to only one peer by using "debug crypto condition peer ipv4 ..."

Marcin

Jorn Halsen Lukassen · ‎11-05-2010

Ok, issue back some time this afternoon. Hard to say, since all that debug info flood the logs, and we unfortunatelly still don't have a log server up yet.

Attached log from both sides.

Marcin Latosiewicz · ‎11-05-2010

Jorn,

To me it looks like a problem with isakmp profile.

I'm not sure you need to set the ivrf on profile, since you're doing ivrf on tunnel interface. I would need to run a lab test - looks pretty easy to reproduce.

I will be away for next 10 days, what I would like to suggest is to open a service request and specify this thread as reference.

Marcin

Jorn Halsen Lukassen · ‎01-02-2011

Just wanted to give you a followup on this.

Tried passing this on to Cisco support, but before it even got that far I decided to try your final proposition.

In fact, removing the ivrf reference from the profile worked. It's been stable ever since !

Thumbs up and a big smile to you