Re: MLPPP with high output drops

m.coakley · ‎05-17-2006

I have a customer who has had a recent drop in performance on their WAN network between two locations. There are 4 T1's bundled together using MLPPP. The lines support at least 100 VoIP sessions using G729a as well as telnet sessions, a dialer program, Windows traffic, Exchange traffic and Internet traffic. There is NO QoS on the lines. (Yes, we are new to this customer and are working through the issues left to us.)

The main customer complaint is that the dialer application on the remote side is not working. The people supporting the dialer application say that it is the network and I need either fix the problem or assure the customer the network is working as expected.

When I look at the Multilink interface it has a tremendous # of output drops on the "hub" side. The remote side has some output drops but it is nothing compared to the hub side.

Here is the Multilink config and show interface.

interface Multilink1

ip address 192.168.X.X 255.255.255.252

load-interval 30

ppp multilink

ppp multilink fragment disable

ppp multilink group 1

!

interface Serial0/0

no ip address

encapsulation ppp

load-interval 30

ppp multilink

ppp multilink group 1

!

interface Serial0/1

no ip address

encapsulation ppp

load-interval 30

ppp multilink

ppp multilink group 1

!

interface Serial1/0

no ip address

encapsulation ppp

load-interval 30

ppp multilink

ppp multilink group 1

!

interface Serial1/1

no ip address

encapsulation ppp

ppp multilink

ppp multilink group 1

Here is the output from the s int m1 command:

Multilink1 is up, line protocol is up

Hardware is multilink group interface

Internet address is 192.168.X.X/30

MTU 1500 bytes, BW 6176 Kbit, DLY 100000 usec,

reliability 255/255, txload 86/255, rxload 54/255

Encapsulation PPP, LCP Open, multilink Open

Open: CDPCP, IPCP, loopback not set

Keepalive set (10 sec)

DTR is pulsed for 2 seconds on reset

Last input 00:00:06, output never, output hang never

Last clearing of "show interface" counters 01:47:51

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 1322003

Queueing strategy: fifo

Output queue: 0/40 (size/max)

30 second input rate 1326000 bits/sec, 2599 packets/sec

30 second output rate 2106000 bits/sec, 2694 packets/sec

17180587 packets input, 1103964228 bytes, 0 no buffer

Received 0 broadcasts, 0 runts, 0 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort

17734499 packets output, 1612610178 bytes, 0 underruns

0 output errors, 0 collisions, 0 interface resets

0 output buffer failures, 0 output buffers swapped out

0 carrier transitions

As you can see the "total output drops" indicates about 204 packets per second dropped.

Here is the output from s ppp mul int m1:

Multilink1, bundle name is ROUTER-TEST

Bundle up for 04:10:43, 91/255 load

Receive buffer limit 48768 bytes, frag timeout 1000 ms

29/2018 fragments/bytes in reassembly list

336 lost fragments, 13101758 reordered

0/0 discarded fragments/bytes, 0 lost received

0x1880E7 received sequence, 0x29944A sent sequence

Member links: 4 active, 0 inactive (max not set, min not set)

Se0/0, since 04:10:43

Se0/1, since 04:10:42

Se1/1, since 04:10:37

Se1/0, since 02:06:38

Thanks,

Mike

a-vazquez · ‎05-23-2006

Output drops are the result of a congested interface (for example, the traffic rate on the outgoing interface can't accept all packets that should be sent out). The ultimate solution to resolve the problem is to increase the line speed.

However, there are ways to prevent, decrease, or control output drops without increasing the line speed. Preventing output drops is only possible if output drops are a consequence of short bursts of data. If output drops are caused by a constant high-rate flow, drops can't be prevented; they can only be controlled.

If increasing the BW is not a near time option for you, it is better to use congestion management which ensures, with appropriate configuration, that important packets are always forwarded, while less important packets are dropped when the link is congested. Congestion management comprises fancy queueing mechanisms:

priority queueing

custom queueing

class-based weighted fair queueing

Output drop prevention should never be attempted by increasing the output queue. If packets stay too long in the output queue, TCP timers might expire and trigger the retransmission. Retransmitted packets only congest the outgoing interface even more.

m.coakley · ‎05-24-2006

I agree with what you are saying BUT I did not feel this was the case in this situation. (Not having a ton of MLPPP experience I asked the question of the forum.) When I look at the serial line usage of the circuits in the ML bundle they are using only about half of their bandwidth. When I look at the CPU load it is running between 60% - 85% constant. The output drops do seem to come in bursts but there are no other indicators that the lines or system is congested. The only line with output drops is the MLPPP bundle. I'm concerned that if I implemented some type of QoS on the ML I might burn up the remaining CPU that I have, making the situation worse. So given the above information do you still beleive that it is a congestion problem?

bbaillie · ‎05-24-2006

The symptoms you describe are usually the result of a FIFO queue and a device(s) that periodically generate excessive traffic causing congestion. You need to identify the offenders and deal with them accordingly. If the offenders are not configurable then QOS is the next option. Use the command "show queue multilink 1" during periods of dropping (if possible) and watch the depth "(depth/weight/total drops/no-buffer drops/interleaves)" in the conversation queue and the hosts involved. The best results can be obtained by going to weighted fair queing on the multilink interface because the offenders will now show up as different conversations and the depth on the offenders conversations will be more than the rest by a identifiable margin.

Good luck,

Brian

m.coakley · ‎05-24-2006

Brian,

Thanks for your input and useful description. I will try this out and report back here my results.

But bottom line, output drops on any interface are bad. I was under the impression that they weren't that bad in this situation because the problem I am having only appears to affect one protocol for a proprietary system. The SIP VoIP calls which make up the majority of the traffic are unaffected.

Thanks,

Mike

bbaillie · ‎05-26-2006

Agreed, output drops are bad on any interface when they are tail drops, and you aren't sure what packets are acually being dropped, when congestion occurs. The SIP VoIP calls may or may not be affected, people may just not report the problems. I have learned over many years that even if you don't hear complaints there may be problems anyway.

Brian

m.coakley · ‎05-30-2006

Brian,

I turned on WFQ and started to see the packets that were dropped in the different sessions. I saw, in multiple sessions, packet drops across a variety of protocols. I too have experiences the "non-complaint" issue that you talked about but I was getting my information from the IT staff, the users and the Executive VP's who swore to me that the VoIP quality was good. Anyway... once confronted with the evidence that all protocols were having the issue the users (all who were previously interviewed) relented and stated that had issues with all protocols. So, we are sending them a new router today to see if it is a CPU issue. The statistics don't really indicate this and given that a 2651XM can handle 40,000PPS and our heaviest traffic, VoIP, at its heaviest point will generate about 5,000PPS I don't think from a packet perspective we are killing the router.

Thanks for all your input. I will update the ticket with the results of the new router.

Mike

bbaillie · ‎05-30-2006

The issue may not be CPU related but actually traffic load related (didn't see evidence of CEF enabled). The CPU utilization does indicate the need for exploring a faster router due the the need for Quality of service application here. Profiling the traffic and deciding what protocols can tolerate drops using WRED will most likely be the move after confirmation of the CPU suspicion. The most important thing to consider is VOIP doesn't tolerate drops and should be using LLQ in a QOS policy, transaction based applications (SQL, Oracle) don't tolerate drops well and should be the next concern, but many TCP based applications will tolerate WRED based drops quite well and actualling will adjust their behaviour due to TCP's design of throttling back when drops are detected.

Brian

m.coakley · ‎06-06-2006

Brian et al,

We put in a new router a 3640 (which basically doubles the performance of the 2651XM) and the problem got worse. Not sure why. I finally got a Cisco TAC case open (long story why we couldn't open one) and the Cisco engineer suggested the following:

1. we replace 3 of the WIC-1DSU-T1 cards

2. Enter the "no ip route-cache" and "no ip mroute-cache" statements on the serial and multilink interfaces

3. If the problem still persisted to enter the "hold-queue 400 out" and "hold-queue-200 in" statements on the serial interfaces

We performed all of the above steps as well and the problem also got worse. I'm not sure as to why Cisco recommended the above but I'm just doing as they say to get to the next step.

If anyone can think of a potential cause of this problem as none of the solutions presented so far have solved the problem. Some have made it better or have given me better insight into the problem but no solution as of yet.

I should also add that I have performed a 2 hour data capture using Ethereal on a span port of all traffic on the WAN Gateway's FastEthernet interface. I'm not seeing too much there. I do have TCP errors/warnings for about 0.3% of the traffic which we are looking into but this doesn't account for 150-180pps of output drops we are seeing.

Thanks,

Mike

m.coakley · ‎06-06-2006

Everyone,

Finally it appears we have a "band-aid" on the problem. We performed tuning on the output queues (using the hold-queue statements) on the serial interfaces and multilink interfaces. The final config had the output queues on the serial interfaces at 200 and the output queue on the multilink at 800. Now under full load the dropped packets per second are at about 8pps. Under the load of the router this seems to be about the best we can do and is working for my customer.

Thanks,

Mike

bbaillie · ‎06-06-2006

There should be very serious research put into the use of QOS to resolve the root problem. The deep queues will result in packets suffering delay and the traffic that doesn't tolerate delay or drops(VOIP,SQL,Oracle) will suffer causing degredation in these protocols. The drops we can safely assume are still tail drops because there doesn't appear to be WRED or WFQ employed. Investigate which conversations are resulting in deeper queues of traffic and identify why they are so aggressive (this assumes WFQ is in use). In a properly deployed QOS policy there will of course still be drops during congestion but WRED determined, QOS controlled drops are always better tolerated than undetermined tail drops.

Cheers,

Brian

m.coakley · ‎06-06-2006

Brian,

I completely agree. This situation presented itself to us as a new customer who we were developing a new network design for as this problem introduced itself. The current routers where having CPU issues at least to the point where introducing QoS would have taken them over the edge.

In the new network design we have spread the traffic out, re-designed the IP topology, used better routers, have a LAN and WAN routing segments, QoS and PBR determined by SLA monitoring. So, needless to say their network will be much better prepared to deal with congestion and bursty/aggressive protocols.

We plan on handing over the sniff sessions to the application developers for the two main proprietary applications that seem to be "wasteful" of network usage.

Thank you for all your help and your comments are truly appreciated and will be worked out in our lab before we install their new network.

Thanks,

Mike

bbaillie · ‎05-23-2006

I agree with the previous post as the best option, but I noticed the output queuing strategy is FIFO on the multilink interface. This means a bandwidth hog application can rapidly fill the queue causing others to timeout and retry because they are not as aggressive as the bad boy. This results in taildrops which is your symptom. Change the output queing to weighted fair queuing on the Multilink interface and this will allow each session a fair chance at the pipe. From here do a number of show queue commands and note the host(s) that cause the most queued packets, that is the beginning of the list of misbehavers. Now design your QOS policy to limit the bad boys so they can't hog the pipe, but not so much limiting as to cause timeouts on the applications.

Brian