I am having a strange problem. We have two sites connected by two 100Mbs lines.
In order to bypass complications raised by propagating our routes over MLPS it was decided
to use two GRE tunnels to connect the two sites.
On both sides of the WAN links we have 1Gbs for all our worskations and servers but the connecting lines are running, as mentioned above, at 100Mbs each.
The CPU usage, Memory Usage for the routers connected at each end of the two lines (2 pairs of C2821) are OK, both variables under 50%. The line is not used at full capacity, we do experience short spikes for traffic values but nothing to worry us and to explain what I am going to say below.
The problem we have is that our users are reporting, and I have tested that myself, dropped SSH sessions or dropped RDP sessions which run accros these two lines.
The GRE tunnel interfaces on each router present output queue drops and these can not be correlated with any spikes for traffic values or CPU loads observed on this routers at the same time with the measuring the output drops rate via SNMP. Moreover the drops seem to happen randomly...
By my knowledge a router should display output queues drops only when it deals with congestion and that should be also revealed by the traffic values measured at that moment when the output drops appear.
Has anybody experienced something like this? Does anybody have any ideea what else could cause these drops? I should also mention that I am seeing output queue drops on the switches connected behind these 2821 roauters (C3750 stacks at one end and 4506 at the other end).
These could also be caused by workstations or servers trying to adjust their TCP window size but...this shouldn't cause SSH or RDP session drops
Thanks in advance for any suggestion
Did you check the reachability between the tunnel end points? A fail of encapsulation could cause output queue drop on tunnel interface.
No reachability should affect other applications too...and should affect everybody at both ends.
We are using SNMP to monitor both ends and I coudn't see any problems.
So far we have reports just about RDP, SSH, sometime web applications and mail clients or OCS
The drops are reported while traffic is flowing fine...
I am using STG and PRTG to monitr live traffic values, no connectivity should translate in no traffic, no definitelly it's not that
Can you check show cef drop, see the number of "Encap_fail"; does that number related to the number of drops on tunnel interface?
I believe I did that too and no ..that was zero
I will; check again Monday
However the problem seems deeper that the dropped packets, that is the only strange thing I could see on my router. What bothers me is that the above mentioned apps can't recover at all..at least ssh which I closely examined
Wirshark packet captures revealed lost tcp segments which I believe is normal up to a degree.
Any dropped ssh session seems to lose the connection after a fragment lost and Double ACK sequence, the last one seems to be the client trying to recover...
The most common reasons of output drop on tunnel are MTU, CPU and encapsulation failure. Since there is no encap_fail from the cef drop output, encapsulation might not be the reason here. Maybe you can turn on debug ip icmp and see if the router generate any ICMP type 3 code 4 message.
I was wrong with this
I cleared the counters and apparently there are some correlations between the numbers
#show cef drop
CEF Drop Statistics
Slot Encap_fail Unresolved Unsupported No_route No_adj ChkSum_Err
RP 203 0 0 0 0 0
#sh int | i drop
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 292
The MTU is 1476 for that interface. What would be the reason for this?
However according with the definition
Encap_fail Indicates the number of packets dropped after exceeding the limit for packets punted to the processor due to missing adjacency information (CEF throttles packets passed up to the process level at a rate of one packet per second).
At this point I have resons to believe you need to know more about this setup.
It is kind of weird but it has worked OK till we started having problems with these drops
I already recommended a L3 upgrade for the CoreA and CoreB (see the attached picture). They are C3750 stacks running L2 images. they should be upgraded to L3 images and that will stop the core traffic to hit the edge routers.
I also know that there is plenty of multicast traffic in that client VLAN and that my get the routers in trouble. Apparenlty the routers are quite busy dealing with interrupts...
show process cpu indicates
CPU utilization for five seconds: 32%/28%; one minute: 34%; five minutes: 33%
Ok, so the 292 is the number of drops on your tunnel interface?
Let's turn on debug ip cef drop to see what cause the encap_fail.
I wish I could do that ...that is subject for a RFC and it will take take
Is there any other way to debug this?
CEF is cisco thing, dont know if there is RFC for it.
The debug is not that CPU intensive, you should not lock yourself out of the box; you can turn on "no logging console", and check the debug from "show log" when you done.
Sorry for the delay and thanks for working with me on this.
RFC is, in this case, a Request For Change .
The environment is quite sensitive and they want everything approved before you touch the routers
What I could do was to adjust the MTU on the Windows (XP) side.
I didn't look at the drops anymore but I can confirm that after decreasing MTU to 1450 the SSH session that I started two days ago worked fine, that never happened before.
If you could please look at the previous network diagram I would like to ask you about the MTU settings we have here..
The GRE tunnel interfacewas configured with an MTU of 1472B which I believe was calulated this way:
1500B -def MTU
24 B GRE overhead
4B 802.1Q overhead
I know that this is tricky and the terminology is not uniformly used on the workstation side or on the router side
the Ethernet packet is 1518B (1500 payload+18B header)
the tagged Ethernet packet is 1522B (1500 payload+18B header+4B tagging overhead)
Adding GRE should substract from the payload 24B which should result in MTU=1500-24=1476B
Have no idea how the person who configured the router came to this 1472B.
I can only guess that he calculated this way MTU=1500B-24B-4B=1472B
but in this last case substracting 4B from the MTU has no sense..the output interface is tagged but the tagging is going to be added after the MTU was checked and confirmed to conform with the maximum configured value..
Another thing that I don't understand is what really fixed the problem. I changed the MTU on the workstation and I decreased that to an arbitrary 1450B in order to make sure the packets will pass through the router no matter what.
I believe that if I had changed that to 1472B it would have worked too. My guess is that we have a missmatch between the Windows XP IP MTU and the GRE tunnel MTU...no matter how the GRE IP MTU was calculated.
Glad to hear you fix the problem.
I agree with you there is no need to subtract another 4 bytes on GRE tunnel, unless you are using Metro-E circuit and the SP uses 1500 MTU.
I think the problem here was because some application send out large packet with DF bit set, so the tunnel has to drop it; change to 1472B should work as well.
A few more questions on this if you don't mind. Actually it is more me trying to confirm my understanding of the situation that I have here.
To recap we have these facts:
What seems to have improved the things was adjusting the MTU size for the workstations that were experiencing problems. MTU=1450 which is below the 1472 value configured for the tunnel.
Connecting the dots my conclusion is that the router is receiving packets with MTU>1472.
That forces the router to bypass CEF and to fragment the pakets which exlains 4.
Having to do fragmentation for too many packets the router will have to punt the packets and because the punt is limited to 2 packets/second the router has to drop packets.
Is this correct?
A better fix for this seems to be using the IP tcp adjust-mss but this will fix just the TCP sessions
Another fix that was suggested to me was to use tunnel path-mtu-discovery. Im my opinion this should be used if we have problems with establishing the tunnel and transmitting packets through the tunnel.
As I described above I believe we are having problems with the packets being fragmented not tunneled.
I am not sure whethere the dropped packets should show up as output dropped packets or not. Can you please confirm?
Cannot agree with you on point 4. Interrupt processing means traffic been cef or fast switched; the percentage of interrupt processing will not show from show proc cpu output, because that command only shows scheduled CPU usuage. If traffic is process switched, it will be processed by IP_INPUT process, and you will see that in show proc cpu output. A fragmented packet will be process swiched, but it should not be dropped; unless the cpu is too high or the process switched packets rate exceed the platform capacity.
For most new IOS, tunnel interface uses CEF switching as default; if the packet cannot be cef switched, for example fragmented packet, it should be process switched. The process switching rate is different per platform.
If the packet has DF bit set, a fragmentation will cause packet drop; otherwise it should be process switched not drop. If a regular packet without DF bit set got dropped because of fragmentation, then it is not a normal behavior.
OMG, You are right I have speculated all this time based on the assuption that the proportion between the total CPU usage and the CPU interrupts usage was showing poor performance while it was exactly the opposite.
I had to refresh my knowledge about switching paths to understand that as oposite to the old computers for routers showing CPU interrupts is actually OK which means CEF (in this instance).
Then I am confused, I can't understand why the router is showing output dropped packets for that tunnel interface while the real time traffic stats do not show spikes or high usage for the connected line.
The dropped session issue was bypassed modifying the MTU size at the client end but I still have no explanation for what is wrong as the MTU settings for that tunnel interface seem to be correctly calculated.
As someone suggested me it might be an MTU size problem on the path between the end of the two tunnels
but that shouldn't result in the output drops that we are seeing shout it?
I am afraid I will have to push for that debug ip cef drops you recommended.
BTW, do you work for TAC ? I may need to open a ticket at this point.
Yes, turn on debug can give you better view of which packet is dropped by cef.
Unfortunately, I do not work in TAC.
the number is still small ..do you thing that I might lose the router while enabling that?
I am not sure if I have physical access to this ....
You won't be able to change the value unless you have Advanced IP Services feature set.