When to upgrade a WAN link

Unanswered Question
Feb 1st, 2008

Our SLA commitments are conditional on the average WAN utilisation remaining below 70%. If the utilisation exceeds 70% we recommend a bandwidth upgrade or a change in the QoS policy.

This figure is now being questioned and I need to identify the rationale for this value. I recall reading somewhere that WAN links should be upgraded if the utilisation exceeds 70% but can't find the reference in any of my Cisco books.

Can someone help me reference the text or did I just imagine it!

Thanks

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Kevin Dorrell Fri, 02/01/2008 - 03:03

I cannot help you with the reference, but I can say that there is much more to a utilisation figure than meets the eye. If you are talking about 70%, you need to ask over what sampling interval.

At any one instant in time, the utilisation of the link is either 0 or 100%. Either a packet is being transmitted at that instant, or it is not. So the the utilisation figure is a question of what proportion of the time, measured over a certain interval, is the line in use.

The crux of the matter there is "over a certain interval". If your measurement interval is very short, then you will get a wildly fluctuating graph that will hit 100% but for very short periods. In the limit, if the measurement interval is infinitessimally small, then the graph whacks between 0 and 100%. If your measurment interval is very long, say 1 hour, or even 1 day, then you will get quite a flat graph, but the peaks will be much lower.

So what measurment interval is fair? Well that all depends on the user experience. If you are transferring files, then you expect delays, so the measurment interval is logically quite long. If the applications ar interactive, then the users expect a faster response, so it is fairer to set a short measurement interval. In real life, the traffic is probably mixed, and most likely has some QoS engineering as well.

So, is 70% useful as a limit on your SLA commitment? It depends on what those commitments are, and what the measurment interval is, and what level of risk margin you are prepared to tolerate. The whole thing is a lot more complex than it seems at first sight.

Kevin Dorrell

Luxembourg

aravindhs Fri, 02/01/2008 - 03:15

Kevin,

Are you saying that the line utilization at any instant is 100 or 0 because, the interface drivers are so built in a way that the tx-ring drives the line to max capacity ?

But your perspective seems very logical.

Cheers

Arav

Kevin Dorrell Fri, 02/01/2008 - 03:25

Arav,

I wasn't really thinking about the internal architecture of the tx-ring. I was really thinking of an external observer monitoring the electrical signals on the line. At any one instant, there will either be a frame in progress, or there will not be. At the instant, there can never be a partial figure.

We are into the realms of statistics, probabilities, noise, and standard distributions here, the mathematics of which is beyond me.

Kevin Dorrell

Luxembourg

aravindhs Fri, 02/01/2008 - 03:31

Hi Kevin,

I was thinking more in line with the clock-rate on the line vs CIR vs the frame-size and the corresponding serialisation delay. Hence, the %age util on the line at any instant.

I am terrible with those probability distributions & stats and won't dare to ask you anything more if it is in those lines .. hehe.

Cheers

Arav

farouqtaj Fri, 02/01/2008 - 04:04

Our sampling interval is one month. we take measurements every 5 minutes and then average them over the 1 month period.

Danilo Dy Fri, 02/01/2008 - 05:03

Hi,

That depends on your total WAN bandwidth, sum of all your customer bandwidth subscribed, how each of your customer used them (more outgoing or more incoming), your formula for oversubscription, and the media/service you use (ATM, MetroE).

If your WAN bandwidth is 100Mbps, I don't think you should upgrade if it hits 70Mbps. However, you should consider the service payload as the 70% might be talking about ATM with AAL5, because the total payload is around 82%.

Though internet SLA is not point-to-point, if you use oversubscription, make sure that your customer can burst the bandwidth they subscribed up to your upstream.

http://www.cisco.com/web/about/ac123/ac147/ac174/ac197/about_cisco_ipj_archive_article09186a00800c8314.html

http://sd.wareonearth.com/~phil/net/overhead/

Regards,

Dandy

Joseph W. Doherty Fri, 02/01/2008 - 19:26

(I wonder whether rationale might be some shadow of a utilization % on shared Ethernet?)

On a WAN link, 70% average utilization only really tells us the over the measured time period only 70% of bits that could be transmitted were transmitted. We normally don't have additional statistic measurements beyond the average such as variance or standard deviation. Further, without knowing the needs or likes of the applications generating such traffic, we can not use these statistics, if we had all of them, for real SLAs that are really useful.

Some information to think about:

Suppose we have a fractional T3 with 40 Mbps.

We have 4 hosts, 10 Mbps LAN, on both sides that move files for the whole measured period, assume an hour. We would see 100% utilization for the whole hour but none are delayed! One host alone would still take an hour to transfer its data although utilization would only be 25%. Three also take an hour, utilization would now be 75%.

Same 4 hosts but now with 100 Mbps LAN. One host, and if only one host, would only take 15 minutes, utilization would be 25%. Three hosts would show 75% utilization, but the individual hosts might see as little as 15 minutes each (if started after the prior finishes) or 45 minutes (if started at the same time). Again, 75% utilization across the hour for either, but network performance might be perceived as quite different, and variable, to the users.

I suggest SLA would be better based on packet delay and loss measurements, rather than a very simplistic link utilization percentage. (Even if the link usage statistics don't provide much toward meaningful SLA, they do make for pretty charts.)

You might want to examine the SLA monitoring available within Cisco routers and/or Corvil technology.

Actions

This Discussion