We have a customer which is migrating all his branch offices (around 250) to a provider cloud with QoS. Most offices have 128kbit guaranteed bandwidth burstable to 512kbit.
To check the QoS configuration of the provider we agreed on ports to classify the IP SLA packets in the correct class. The provider assured that the IP SLA measurement packet are classified in the correct classes. The branch office router is the IP SLA responder.
There are 4 classes agreed upon, actually 3 because the class for voice traffic isn't in use at the moment. Class2 is receiving 80% of the bandwidth, 80% of the remaining bandwith is used for class3, the rest is for class4 (Best-Effort). In the measurements we see sometimes very high max. RTT values. Average values are ok.
Until now we didn't find any explanation for these high max. RTT values. Sometimes we see values between 1000ms and 3000ms, which means between 1 and 3 seconds! These high values make it hard for us to feel comfortable with the setup.
According to the TAC engineer we don't have any bug with the IOS were using, nor do we have a configuration error. We are using the NTP source of the provider. Even an upgrade of the bandwidth of the branch office to 1mbit/1mbit(all bandwidth is guaranteed) didn't change the situation.
An example measurement :
Round trip time (RTT) Index 55111
Latest RTT: 25579 usec
Latest operation start time: 13:44:42.183 GMT+2 Thu Sep 17 2009
Latest operation return code: Over threshold
Number Of RTT: 60
RTT Min/Avg/Max: 10896/25579/380906 usec
Latency one-way time microseconds
Number of one-way Samples: 0
Source to Destination one way Min/Avg/Max: 0/0/0 usec
Destination to Source one way Min/Avg/Max: 0/0/0 usec
Jitter time microseconds
Number of SD Jitter Samples: 59
Number of DS Jitter Samples: 59
Source to Destination Jitter Min/Avg/Max: 127/1090/20965 usec
Destination to Source Jitter Min/Avg/Max: 56/15026/366617 usec
Packet Loss Values
Loss Source to Destination: 0 Loss Destination to Source: 0
Out Of Sequence: 0 Tail Drop: 0 Packet Late Arrival: 0
Voice Score Values
Calculated Planning Impairment Factor (ICPIF): 0
Mean Opinion Score (MOS): 0
Number of successes: 31
Number of failures: 0
Operation time to live: Forever
This measurement is showing 381ms or 380906Âµs which is relatively good compared to the high values we see sometimes. This is a measurement in class2 to the branch office with 1Mbit guaranteed traffic. In the "Packet Loss Values" I never saw values other than zero. So no packet loss, out of sequence, ... On recommendation of the TAC engineer we added precision microseconds and clock-tolerance ntp oneway percent 10. Therefore you see usec in the output. Since this configuration the one-way latency values are zero, which would indicate that NTP synchronization status is not synchronized according to the configuration guide. The show ntp status command shows nevertheless synchronized.
The IP SLA router is the only router I managed and is located in the headquarters of the customer. I also perform an IP SLA measurement to the LAN interface of the CE router of the provider in HQ. No latency is found here.
The CE routers in HQ and in the branch office are Cisco devices and managed by the provider, I only have SNMP read access to gather some info. The backbone of the provider are Alcatel devices were no saturation is found, which I'm willing to believe because I never saw drops, loss or out of sequence in the statistics. Apparently we're the first company to do IP SLA measurements in the provider cloud so the provider doesn't have a lot experience with other customers doing the same thing.
Has anyone experienced the same thing and found the root cause for this ?
Thanks in advance.