Solved: Re: Internet...a very messy place to be?

Ricky S · ‎11-09-2013

Hi everyone, for the last few months I've been struggling with an issue. I have spent countless hours trying to find an answer.

I run a DMVPN network over the internet serving approx. 80 remote offices including a data center. Most of the offices play nice and behave. However I have a group of about 6-7 offices which have this issue. Users complain of poor application performance affecting everything they do.

Basically there is packet loss. Sometimes it can be very minimal and other times very high.

Below is a scenario

Office A External IP: aaa.aaa.aaa.aaa

Office B External IP: bbb.bbb.bbb.bbb

Office C External IP: ccc.ccc.ccc.ccc

Office D External IP: ddd.ddd.ddd.ddd

If I ping aaa.aaa.aaa.aaa with large data packets of 1500 bytes over the internet (extended ping sourced from outside physical interface) from Office B, I get a LOT of time outs and packet loss.
But if I do the same i.e. ping aaa.aaa.aaa.aaa over the internet from Office C, pings go through just fine.

It would be the same for Office D etc. but then again same issue happens with office E and so on.

All my physical interface MTU settings are manually configured at 1492 Bytes (to avoid fragmentation due to IPSEC) with tunnel interfaces configured with 1400 byte MTU.

Chicago is one of my main offices that hosts a lot of application servers for many other offices. These other offices all connect to Chicago over DMVPN. Pinging Chicago from Columbus or Atlanta, I'll experience major packet loss however NOT if I ping from Orlando or as far as Anchorage etc.
We do have WAN accelerators in place behind the routers but this issue is just getting to the global IP of the routers.
I can start simultaneous extended pings of 1500 bytes from several offices to Chicago. Some I will have major packet loss, others very little and then others with no loss whatsoever.

Is this an issue out of my control i.e. on the ISP side? Is internet really just not as well maintained as one would think where some ISPs don't talk to each other nice while others do?

If anyone can add their 2 cents, that would be much appreciated.

Thanks all in advance

Joseph W. Doherty · ‎11-25-2013

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

Hi Joseph, MTU of 1492 bytes is not just for PPoE but I set it for others also. In the past I've seen issues where an ISP, even though they say otherwise, sometimes won't allow a 1500 byte packet. So I've always made it my second nature to change the MTU to 1492 byte when it comes to VPN.

It might be you've bumped into misconfigured (vendor) MPLS, which often will take 4 bytes from the MTU.

Should I set it to 1400 for other offices?

If the 1420 is working for you without fragmentation, it does provide 20 more bytes of payload. Again, the 1400 setting is just an additional allowance.

Something is wrong in the one case you describe where 1400 breaks your app.

I am a bit confused on the behavior of ip tcp mss-adjust and am glad you brought it up. If I configure this on my router, does it mean that the source host and the router will negotiate the TCP packet size before the packet even makes it to the GRE process on the same router? Where would I configure this command? On the inside physical interface on the router?

TCP mss-adjust works during TCP session hand-shake/setup. You can configure it anywhere in-line, where it will "see" the end-to-end setup. Normally, I set it on the tunnel interface. (Normally you set it for MTU less 40.) Also, BTW, it will work on TCP traffic flowing in either direction, i.e. you don't need to set it more than on one interface.

I am not sure about the default, but I use IPSEC in transport mode since the secure connection is only required between routers. Tunnel mode adds extra bytes to the packet so I don't have that problem with transport mode.

I just looked it up. Tunnel mode is the default and it was transport mode I had in mind (which save 20 bytes).

View solution in original post

Jeff Van Houten · ‎11-10-2013

Peering agreements between carriers are not all the same, but it sounds like this is all US based so I doubt that is the root cause. I would bet that upstream from the sites where you are having problems, most likely not on the first hop, there is a connection having a problem. For the sites having problems, try trace route to pinpoint where the response times and or failures begin. Have the carriers look into any issues along the path.

Sent from Cisco Technical Support iPad App

colin.farley · ‎11-10-2013

This is not uncommon in my experience. Since you have access to both ends it should be pretty easy to pinpoint the issue and forward to the closest ISP for investigation.

Sent from Cisco Technical Support Android App

edsge teenstra · ‎11-11-2013

You write "Basically there is packet loss. Sometimes it can be very minimal and other times very high. " When it minimal and when is it high ? Is there a pattern ?

Are the WAN connections the same at good or the bad locations ?

Have you asked you ISP if they can measure the line ?

Is there a location that has exactly the same router/WAN accelerators/ISP WAN interface/same troughput/IOS as one of the bad locations ?

Traceroute would be great to see where it happens .. also look into the show tech-support at show process cpu (sorted/history) or look at the dropped packets to find more information.

Have you got this issue from the start ? If you restart the network equipment at one location does it make it better for a while ?

Does your Cisco router show command line messages telling anything ?

Ricky S · ‎11-11-2013

Hi Edsge,

Packet loss happens most of the times and mostly random. I tried to find a pattern but haven't been able to do so. Sometimes it happens when both circuits are very busy and other times it happens when there is very minimal traffic at both sites.

WAN connection is with different providers. Both 100 Meg pipes, with Cogent and AT&T. I have spoke with Cogent but never caled ATT which I am going to do tomorrow.

I have tried re-starting the equipment to no avail.

Traceroutes don't really show much in that, they take different paths from different locations. I don't see any drops or extreme high latency anywhere. Although I do notice that the path always changes from a given location for consecutive traceroute operations. Maybe that'a a clue that the ISP's network is not stable.

I am going to call the ISP tomorrow and see if they can see anything.

Thanks all for your suggestions.

Ricky S · ‎11-25-2013

Hi everyone, after reading up a tonne of material on IP fragmentation issues with IPSEC/GRE, I was able to resolve my issue by following the below steps:

- set a max mtu on the outgoing physical interface (1492 bytes)

- Do NOT clear the df bit at the router (I had a policy that was doing just that)

- Set tunnel MTU at 1420 bytes

- Turn on tunnel path-mtu-discovery on my GRE tunnel interfaces to allow better coordination between GRE and IPSEC on the router

Above allowed the end hosts to negotiate a TCP window based on PMTU. Also took care of the excess fragmentation that was happening at the router.

Basically if a router is fragmenting each 1500 byte packet it receives and then encapsulating and encrypting each fragment before shipping it over the internet, the end router has to collect each fragment to re-assemble the data packet. And if you loose even a single fragment along the path, the entire packet must be re-fragmented, re-encapsulated, re-encrypted and resent over the wire. This causes a lot of application delay and the frustration with end users. Since DMVPN relies on the internet, and internet is a very messy place, it's very common to loose a fragment every now and then.

There is a great Cisco article that I have printed out and has become my bible. I keep it everywhere I go.

http://www.cisco.com/en/US/tech/tk827/tk369/technologies_white_paper09186a00800d6979.shtml

Hopefully this will help resolve issues that other might be having with IP fragmentation in an IPSEC/GRE world.

Joseph W. Doherty · ‎11-25-2013

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

- set a max mtu on the outgoing physical interface (1492 bytes)

For PPPoE, correct?

- Set tunnel MTU at 1420 bytes

That's about right, but Cisco sometimes recommend setting to 1400, just to insure you don't accidentally go over (such as not allowing for the extra 8 byte loss for PPPoE).

Some other hints:

If supported, the ip tcp mss-adjust command avoids the need for PMTUD to downsize MTU (and repeating).

With GRE/IPSec p2p, switch IPSec mode from its default (forget whether default is transport or tunnel mode) will allow more bytes for payload.

Ricky S · ‎11-25-2013

Hi Joseph, MTU of 1492 bytes is not just for PPoE but I set it for others also. In the past I've seen issues where an ISP, even though they say otherwise, sometimes won't allow a 1500 byte packet. So I've always made it my second nature to change the MTU to 1492 byte when it comes to VPN.

Actually I was a bit torn on whether I should set the tunnel to 1400 bytes or take a chance with 1420. One of my central offices in the eastern united states has a VPN connection to another vendor through an ASA firewall. There are about 8 other offices that connect to this vendor through this central office. This vendor hosts an application server which everyone regularly needs access to. If it goes down for even 1 hour, my phone starts ringing off the hook.

I find if I set the MTU to 1400 bytes on the tunnel, this web application stops working. The browser keeps spinning its wheel and never actully load the app. However if I set it to 1420, application loads up with no issues.

I am still a bit on the edge with this one for all other offices where I dont' want the 1420 to cause any issues.

Should I set it to 1400 for other offices?

I am a bit confused on the behavior of ip tcp mss-adjust and am glad you brought it up. If I configure this on my router, does it mean that the source host and the router will negotiate the TCP packet size before the packet even makes it to the GRE process on the same router? Where would I configure this command? On the inside physical interface on the router?

I am not sure about the default, but I use IPSEC in transport mode since the secure connection is only required between routers. Tunnel mode adds extra bytes to the packet so I don't have that problem with transport mode.

Please advise.

Thank you so much for your help.

Joseph W. Doherty · ‎11-25-2013

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

Hi Joseph, MTU of 1492 bytes is not just for PPoE but I set it for others also. In the past I've seen issues where an ISP, even though they say otherwise, sometimes won't allow a 1500 byte packet. So I've always made it my second nature to change the MTU to 1492 byte when it comes to VPN.

It might be you've bumped into misconfigured (vendor) MPLS, which often will take 4 bytes from the MTU.

Should I set it to 1400 for other offices?

If the 1420 is working for you without fragmentation, it does provide 20 more bytes of payload. Again, the 1400 setting is just an additional allowance.

Something is wrong in the one case you describe where 1400 breaks your app.

I am a bit confused on the behavior of ip tcp mss-adjust and am glad you brought it up. If I configure this on my router, does it mean that the source host and the router will negotiate the TCP packet size before the packet even makes it to the GRE process on the same router? Where would I configure this command? On the inside physical interface on the router?

TCP mss-adjust works during TCP session hand-shake/setup. You can configure it anywhere in-line, where it will "see" the end-to-end setup. Normally, I set it on the tunnel interface. (Normally you set it for MTU less 40.) Also, BTW, it will work on TCP traffic flowing in either direction, i.e. you don't need to set it more than on one interface.

I am not sure about the default, but I use IPSEC in transport mode since the secure connection is only required between routers. Tunnel mode adds extra bytes to the packet so I don't have that problem with transport mode.

I just looked it up. Tunnel mode is the default and it was transport mode I had in mind (which save 20 bytes).