High maximum RTT with IP SLA measurements

GuyVanDeWiel · ‎09-17-2009

Hi all,

We have a customer which is migrating all his branch offices (around 250) to a provider cloud with QoS. Most offices have 128kbit guaranteed bandwidth burstable to 512kbit.

To check the QoS configuration of the provider we agreed on ports to classify the IP SLA packets in the correct class. The provider assured that the IP SLA measurement packet are classified in the correct classes. The branch office router is the IP SLA responder.

There are 4 classes agreed upon, actually 3 because the class for voice traffic isn't in use at the moment. Class2 is receiving 80% of the bandwidth, 80% of the remaining bandwith is used for class3, the rest is for class4 (Best-Effort). In the measurements we see sometimes very high max. RTT values. Average values are ok.

Until now we didn't find any explanation for these high max. RTT values. Sometimes we see values between 1000ms and 3000ms, which means between 1 and 3 seconds! These high values make it hard for us to feel comfortable with the setup.

According to the TAC engineer we don't have any bug with the IOS were using, nor do we have a configuration error. We are using the NTP source of the provider. Even an upgrade of the bandwidth of the branch office to 1mbit/1mbit(all bandwidth is guaranteed) didn't change the situation.

An example measurement :

Round trip time (RTT) Index 55111

Latest RTT: 25579 usec

Latest operation start time: 13:44:42.183 GMT+2 Thu Sep 17 2009

Latest operation return code: Over threshold

RTT Values

Number Of RTT: 60

RTT Min/Avg/Max: 10896/25579/380906 usec

Latency one-way time microseconds

Number of one-way Samples: 0

Source to Destination one way Min/Avg/Max: 0/0/0 usec

Destination to Source one way Min/Avg/Max: 0/0/0 usec

Jitter time microseconds

Number of SD Jitter Samples: 59

Number of DS Jitter Samples: 59

Source to Destination Jitter Min/Avg/Max: 127/1090/20965 usec

Destination to Source Jitter Min/Avg/Max: 56/15026/366617 usec

Packet Loss Values

Loss Source to Destination: 0 Loss Destination to Source: 0

Out Of Sequence: 0 Tail Drop: 0 Packet Late Arrival: 0

Voice Score Values

Calculated Planning Impairment Factor (ICPIF): 0

Mean Opinion Score (MOS): 0

Number of successes: 31

Number of failures: 0

Operation time to live: Forever

This measurement is showing 381ms or 380906Âµs which is relatively good compared to the high values we see sometimes. This is a measurement in class2 to the branch office with 1Mbit guaranteed traffic. In the "Packet Loss Values" I never saw values other than zero. So no packet loss, out of sequence, ... On recommendation of the TAC engineer we added precision microseconds and clock-tolerance ntp oneway percent 10. Therefore you see usec in the output. Since this configuration the one-way latency values are zero, which would indicate that NTP synchronization status is not synchronized according to the configuration guide. The show ntp status command shows nevertheless synchronized.

The IP SLA router is the only router I managed and is located in the headquarters of the customer. I also perform an IP SLA measurement to the LAN interface of the CE router of the provider in HQ. No latency is found here.

The CE routers in HQ and in the branch office are Cisco devices and managed by the provider, I only have SNMP read access to gather some info. The backbone of the provider are Alcatel devices were no saturation is found, which I'm willing to believe because I never saw drops, loss or out of sequence in the statistics. Apparently we're the first company to do IP SLA measurements in the provider cloud so the provider doesn't have a lot experience with other customers doing the same thing.

Has anyone experienced the same thing and found the root cause for this ?

Thanks in advance.

Best regards,

Guy

Joseph W. Doherty · ‎09-17-2009

The devices you're directing the SLA tests have been enabled SLA reponders? (I assume they have from you decription, but depending on the SLA test, that's not always a requirement.)

What kind of WAN cloud is this?

What's the bandwidth at the HQ site?

What's the WAN link techonology at the branch sites? Please explain further guaranteed and burstable rates.

You and vendor use SLA ports to classify SLA QoS class? How is other traffic directed to correct QoS class? Assuming DSCP markings, why didn't you just use similar markings for your SLA packets?

If live traffic being passed? If so, how is that traffic directed to the defined QoS classes? BTW, (if my math is correct) 4% (or less? - what's class 1 allocation?) seems rather light for BE.

Besides SLA tests, have your run any continuous ping test? If so, do you also see occasional very high ping times or is this all unique to SLA tests?

PS:

BTW, the lack of packet loss doesn't always infer there's no congestion. Deep queues can often avoid packet loss but indicate congestion by additional (queuing) delay.

[edit]

PPS:

A quick method I use to confirm vendor QoS is working as expected, I push a UDP data steam toward a branch at 100%, or slightly more, of branch link bandwidth with this traffic in least prioritized class. (This also assumes your sending site has more bandwidth than the branch.)

What you should see is hugh jump in same QoS class latency, and likely drops. Higher priorized classes, if their current traffic is less than their class allocations should show mimimal impact.

If you do the same test at a high class priority level, performance for lower classes might be very much impacted; impact depends on what their current class utilization is.

NB: This test, especially if vendor QoS incorrect, can be very hard on live concurrent production traffic.

GuyVanDeWiel · ‎09-17-2009

Joseph,

Thank you for your response! Because the amount of questions I've added your questions again marked with *** in front of them.

***The devices you're directing the SLA tests have been enabled SLA reponders?

Yes, the branch routers have IP SLA responder configured.

***What kind of WAN cloud is this?

The backbone technology is Ethernet (Alcatel devices).

***What's the bandwidth at the HQ site?

At HQ the link is 30Mbit.

***What's the WAN link techonology at the branch sites? Please explain further guaranteed and burstable rates.

Last mile to the branch office are SDSL links with ATM technology. For the guaranteed traffic the packets are marked with the CLP bit (Cell Loss Priority bit), so packets aren't dropped immediately but drop chance is higher on ATM technology is higher.

***You and vendor use SLA ports to classify SLA QoS class? How is other ***traffic directed to correct QoS class? Assuming DSCP markings, why didn't ***you just use similar markings for your SLA packets?

Provider uses IP Precedence value to classify traffic.

Customer decided which applications belong to which class. On the CE routers of the provider access-lists are configured to be able mark the traffic with the correct IP Precedence value on the LAN interface of the CE routers, classification is done on the WAN link.

The source address of the IP SLA router + range of source ports and a range of destination ports are used to classify the IP SLA packets in the correct class. As a test, I did configure the same IP Precedence value for the IP SLA measurement packets but this marking is not trusted by the CE router and packets are remarked. The provider has confirmed that the IP SLA packets are in the correct class so we didn't go furter on this.

***If live traffic being passed? If so, how is that traffic directed to the ***defined QoS classes? BTW, (if my math is correct) 4% (or less? - what's ***class 1 allocation?) seems rather light for BE.

Your math is correct, and yes we also have the opinion the guaranteed bandwidth is too low, but yet with 1Mbit guaranteed we still see High response times in Class2 (80%) of the bandwidth.

Class1 is not configured because VOIP is not active.

***Besides SLA tests, have your run any continuous ping test? If so, do you ***also see occasional very high ping times or is this all unique to SLA ***tests?

Ping tests are classified in Best Effort class. Also in this class we have an IP SLA measurement and these max. values are even higher, which is logical due to the low amount of available bandwidth.

PS:

***BTW, the lack of packet loss doesn't always infer there's no congestion. ***Deep queues can often avoid packet loss but indicate congestion by ***additional (queuing) delay.

I agree, but wRED is also active so I would presume to see some drops there but no drops are seen. I guess no extra deep queues are configured.

[edit]

PPS:

***A quick method I use to confirm vendor QoS is working as expected, I push ***a UDP data steam toward a branch at 100%, or slightly more, of branch ***link bandwidth with this traffic in least prioritized class. (This also ***assumes your sending site has more bandwidth than the branch.)

***What you should see is hugh jump in same QoS class latency, and likely ***drops. Higher priorized classes, if their current traffic is less than ***their class allocations should show mimimal impact.

Thanks for the suggestion, I'll check if this is a possible test to do.

Again, I agree, but I have even spotted high max. response time while there is no high bandwidth usage... but most of the time higher values occur when there is higher load. With higher load I mean that there is load but the link is not saturated.

GuyVanDeWiel · ‎09-17-2009

Example of the IP SLA configuration for class 2 with marked IP SLA packet:

(I replaced the IP addresses by *)

ip sla monitor 55118

type jitter dest-ipaddr *.*.*.* dest-port 55118 source-ipaddr *.*.*.* source-port 55118 num-packets 60 interval 1000

precision microseconds

clock-tolerance ntp oneway percent 10

request-data-size 64

tos 128

tag Cat2 Branch officeA

frequency 120

ip sla monitor schedule 55118 life forever start-time now

For each class a similar configuration exists (with other ports). Normally for class2 and class3 the frequency is 900 (15 minutes) but for testing purposes I increased the frequency for this site.

Joseph W. Doherty · ‎09-17-2009

BTW, WRED too can support deep queuing; much depends on how its configured. The fact that its queue depths and drop probabilities are often tied to a moving average queue depth also makes it more likely to queue bursts. (Actually, this is a RED feature, to pass bursts but drop packets when there's sustained congestion.)

From what you're describing, your SLA tests might be "seeing" transient congestion due to queued bursts. Often such transient queuing delays don't require a saturated link but overall average load is usually higher, which you note when issue occurs, so this also may be the case here.

Actual analsis of this issue is going to be difficult because you're working with a managed WAN, and so much is beyond your control.

I would try pings in addition to the SLA tests. Even if they only run in BE, you can compare such ping results with your BE SLA tests and see if they show similar latencies at the same time. This hopefully might highlight whether there's really a perfomance issue across the WAN or there's some issue with SLA feature on Cisco devices.

You can also pursue the traffic generation tests (also if only in BE), again to determine WAN network performance is as expectecd.

At this point, there's really not enough information to determine why you're seeing the results you do. However, I will note, WAN vendors also make mistakes, which sometimes only come to light when a customer verifies performance.

PS:

An example of customer finding WAN vendor network issue: I had one case where I kept after a (tier one) WAN vendor that performance wasn't quite as I believed it should be. Took two months for them to find the cause, which turned out to be due to some buggy firmware code for an Ethernet interface board on one of their devices along the path.