Solved: ISR G2 and GRE fragmentation/reassembly

pascalfr0 · ‎05-15-2014

Hi,

We plan to use GRE tunnels between CPE (ISR G2 if we stick to Cisco routers) and LNS (ASR1006 - L2TP and GRE aggregation), above PPP.

PPP MTU is 1500 bytes, and the GRE tunnel will set its MTU to 1476 bytes.

Subscribers link could range from 1M SDSL lines to 16M SDSL/EFM lines.

Using ip tcp_mss_adjust on the tunnel interface will prevent ip fragmentation from happening for TCP traffic.

But we could still see ip fragmentation for non TCP traffic (UDP, IPSEC...) with packets > 1476 Bytes.

For these fragmented datagrams, reassembly will be handled by the destinations hosts.

We are investigating a solution where ip fragmentation/reassembly would be done only between CPE and LNS.

Usually, in the situation that i have described above, the end-user ip datagrams entering the CPE from a LAN interface and sent through the GRE tunnel are fragmented, then the 2 resulting fragments are encapsulated into 2 GRE packets and sent toward the tunnel destination (the LNS). There, the 2 IP fragments are popped out of the GRE packets and sent toward their ip destination. The destination host have to reassemble the 2 fragments.

The idea would be to configure an IP MTU = 1500 at the GRE interface level, so that the end-user IP datagram will not be fragmented. The CPE will create a 1524 bytes GRE datagram, and fragment the GRE datagram (not the end-user datagram encapsulated within). The 2 fragments will be sent to the GRE tunnel destination (the ASR1006), and the ASR will reassemble the initial GRE packet, and pop the end-user IP datagram from it.

=> the end-user systems won't see any fragmentation of their traffic,

=> most of the traffic is TCP and will never be fragmented thanks to mss_adjust, so this mecanism will only be triggered by non TCP packets > 1476B,

=> the CPE and LNS will have to handle IP GRE reassembly for non TCP traffic, for packets > 1476 bytes.

At LNS side, this process is handled on QFP (with hardware acceleration), and maybe we will ask for a CPOC to check ASR performance with ESP40 and ESP100.

At CPE side, it is more than likely done in process switching. Anyway, in worst case scenario, 16Mb/s full duplex needs only 2666 packets per second to fill the line both ways (1333 pps downstream, 1333 upstream).

Is 2666 pps (== 5333 fragments per seconds) something that a ISRG2 CPE (cisco898/lantic, c1941 and above) can handle without CPU exhaustion ?

Joseph W. Doherty · ‎05-15-2014

isclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

What you're doing, being somewhat unusual, you'll probably not find performance documentation for it.

If if you had process switching performance values, I suspect fragmentation processing might be even worst.

About a year ago, had a case of a pair of 2800s take a huge jump in CPU usage. These routers were using GRE tunnels, and were configured with mss-adjust. However, remote site added a few security cameras which sent their video via UDP, and as you noted, mss-adjust did not help those streams.

Our "cure" was usage of jumbo Ethernet on VPN backside which avoided the need to fragment any 1477..1500 sized packets. CPU utilization hugely dropped for the same volume of traffic.

So, at least on the 2800 series, fragmentation was very CPU intensive. BTW, it didn't show as process CPU; it was part of interrupt CPU.

Unfortunately, we didn't bother trying to analyze how "costly" the fragmentation was relative to PPS, but for traffic before vs. after, with and without fragmentation, CPU hit was huge (something like 20% vs. 80%).

View solution in original post

Joseph W. Doherty · ‎05-15-2014

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

See attachment for more detailed performance information on G2 ISRs.

pascalfr0 · ‎05-15-2014

I had this white paper in mind already, but can't find any figures relevant for my configuration in it.

In the Cisco quick reference for router performances, there's no clue about ISR G2 process switching performance either...

c898 : +/- 90000 pps with CEF, nothing about process switching (new 2014 898 model, designed to comply with ROHS, are said to be more powerful than pre-2014 models).

c1941 : 290000 pps with CEF, nothing about process switching performances.

And i'm also wondering whether process switching pps performances numbers would be accurate enough to figure out the IP reassembly capacity of a small platform.

Let say we have a platform with 11000 pps/process switching capacity (c2821). Can I safely guess that this platform will be able to handle the reassembly of the 2666pps I need in my worst case scenario ?

Joseph W. Doherty · ‎05-15-2014

isclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

What you're doing, being somewhat unusual, you'll probably not find performance documentation for it.

If if you had process switching performance values, I suspect fragmentation processing might be even worst.

About a year ago, had a case of a pair of 2800s take a huge jump in CPU usage. These routers were using GRE tunnels, and were configured with mss-adjust. However, remote site added a few security cameras which sent their video via UDP, and as you noted, mss-adjust did not help those streams.

Our "cure" was usage of jumbo Ethernet on VPN backside which avoided the need to fragment any 1477..1500 sized packets. CPU utilization hugely dropped for the same volume of traffic.

So, at least on the 2800 series, fragmentation was very CPU intensive. BTW, it didn't show as process CPU; it was part of interrupt CPU.

Unfortunately, we didn't bother trying to analyze how "costly" the fragmentation was relative to PPS, but for traffic before vs. after, with and without fragmentation, CPU hit was huge (something like 20% vs. 80%).

pascalfr0 · ‎05-15-2014

Very interesting feedback. Video is one of my main concerns (UDP, large packets...), with IPSEC being the second one.

Do you remember what were the traffic throughput (overall throughput and video throughput), and the CPE CPU load, in the case you describe here ?

It seems odd that the router reported interrupt CPU and not process CPU... Did you check the switching mode for those packets at the tunnel interface or physical interface level ? (show interface switching etc...) ?

Joseph W. Doherty · ‎05-16-2014

sclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

I think the video consumption used about 5 Mbps.

Initially, the tunnels were VTI (IPSec), but later changed to GRE. This helped bring the CPU down, not because of the lack of encryption, but because larger video packets could be sent without the need to fragment.

I'm pretty sure the bulk of CPU was under interrupt, but didn't check the stats you're asking about. (NB: remember on ISRs, really everything is "process switched", but "interrupt" CPU is fast, very optimized, processing.)