ATM/T1 Point to Point circuit

Sean McCoy · ‎05-28-2009

I'm trying to troubleshoot a bandwidth. I have several point to point T1's in my district that all map back to a single location. The ping times from one particular 2651 router to a host machine at the remote site are very high. There are only about 10 users at the remote site who access an SQL application. It is a 1.5MB point to point via ATM to an LS1010. Any ideas of where to begin to pinpoint the latency?

paolo bevilacqua · ‎05-28-2009

What's the delay difference? Do you see any error on interfaces?

Sean McCoy · ‎05-29-2009

The ping times are very erratic. When issuing a PING command to a host on the other end of the WAN link, it varies from 4/7/8 ms to 4/20/36 ms. I'm just trying to pinpoint. What troubleshooting methods can I use to determine errors on either the LAN or WAN interface?

pompeychimes · ‎05-29-2009

Are ping times between WAN interfaces consistent? If yes the problem is likely in either LAN. If no the problem is likely on the circuit. Apart from eratic ping times are you experiencing any other issues.

Sean McCoy · ‎05-29-2009

Ping times are not consistent. From the remote router (WAN interface IP 192.168.1.22), a ping to to the other side (192.168.1.21) reveals 4/6/8 ms then another ping command comes back 32/36/48 ms. It seems to bounce around quite a bit. I've had the ISP test the circuit which always comes back clean. I would like to get them onsite to check the TELCO equipment but I doubt they will come out.

pompeychimes · ‎05-29-2009

What does the cpu, mem, and circuit utilization look like? Ping inconsitentcy aside what is the actual problem you are troubleshooting?

Sean McCoy · ‎05-29-2009

I have an SQL database that the remote sites accesses for student records. There is latency when they open the application at the remote side. I would point the finger at bandwidth as the culprit but there are only 10-12 users at the remote site. What are the commands to check out memory, cpu and circuit utilization?

pompeychimes · ‎05-29-2009

sh mem stat

sh proc cpu

sh int (Int name and number)

post output here if you need assitance.

Sean McCoy · ‎05-29-2009

sh proc cpu and sh mem stat too large to post here

jimmysands73_2 · ‎05-29-2009

"sh proc cpu and sh mem stat too large to post here"

You can also cut/paste it into a .txt file, and add it here as an attachment.

Sean McCoy · ‎05-29-2009

Sho int command from other side of WAN link:

OB_BOE_Admin7206VXR_WAN#sho int ATM2/0.60

ATM2/0.60 is up, line protocol is up

Hardware is ENHANCED ATM PA

Description: 1.5MB ATM Conn. to John Glenn School remote CID ASST.100136..NW

Internet address is 192.168.1.21/30

MTU 4470 bytes, BW 1500 Kbit, DLY 190 usec,

reliability 255/255, txload 26/255, rxload 5/255

Encapsulation ATM

183295589 packets input, 4017060384 cells, 188578314481 bytes

154333010 packets output, 1452881578 cells, 67840699668 bytes

0 OAM cells input, 0 OAM cells output

AAL5 CRC errors : 154

AAL5 SAR Timeouts : 0

AAL5 Oversized SDUs : 0

Last clearing of "show interface" counters never

paolo bevilacqua · ‎05-29-2009

Hi, sometime few users are more than enough to saturate a circuit. Are you taking bandwidth usage graphs ?

Sean McCoy · ‎05-29-2009

sho int command:

OBBOE_JohnGlennSchool_2651#sho int ATM1/0.10

ATM1/0.10 is up, line protocol is up

Hardware is ATM T1

Description: 1.5MB ATM PVC to AdminBldg Remote CID 24.ASST100130..NW

Internet address is 192.168.1.22/30

MTU 4470 bytes, BW 1500 Kbit, DLY 20000 usec,

reliability 255/255, txload 69/255, rxload 69/255

Encapsulation ATM

95735450 packets input, 44942234040 bytes

113557778 packets output,116857343068 bytes

0 OAM cells input, 0 OAM cells output

AAL5 CRC errors : 12

AAL5 SAR Timeouts : 0

AAL5 Oversized SDUs : 0

AAL5 length violation : 0

AAL5 CPI Error : 0

pompeychimes · ‎05-29-2009

At the time you issued thes commands all looks well as far as load on the circuit is concerned. For troubleshooting purposes i'd recomend changing the load interval to 30 secs on both sides to get a more accurate picture.

Are only remote site users having trouble with the app?

Did the app ever work / respond efficently?

Is the problem consistently a problem or every now and again or at certain times during the day?

What other types of traffic traverse this link?

Is there any QoS configured?

Joseph W. Doherty · ‎05-30-2009

I've found variable latency is often, altough not always, caused by transient congestion when using FIFO queuing.

Consider a single large packet of 1500 bytes, or 12,000 bits, will take (12,000/1.5 Mbps) = 8 ms to serialize (it would be somewhat more for ATM due to "cell tax"). So, it doesn't take many large packets, in a FIFO queue, to provide varible latency. (I'll come back to this shortly.)

Since ATM is a cloud technology, it's possible there's transient congestion caused by other (not yours) traffic within the ATM cloud. If you're within your contracted PVC bandwidths, you can complain to your provider. However, even if you're not within contract, it's unusual to encounter this with many ATM providers, especially within the US and at T1 PVC bandwidths. More often we need to worry about the T1 bottleneck at our site, both ingress and egress.

So, back to our sites, for T1 bandwidths, it doesn't take many large packets in a FIFO queue to contribute to variable latency.

Some transient congestion can be hard to "see" with ordinary monitoring which sees an average load over some "large" time period (half minute or minutes) vs. millisecond time periods (as seen by pings).

Even with high precision bandwidth monitoring, under certain conditions of congestion, overall average utilization can still be low. This can happen for TCP if it keeps getting driven into slow start or has transmission timeouts. (This can be recognized if you have stats on interface drops - including ATM cloud egress.)

Even if the congestion/load isn't high enough to cause drops, again, it's possible there are microbursts that queue. If you can capture stats for current queue lengths (software and hardware [there's also the interface FIFO queue]), you can sometimes "see" this (also again, there's ATM cloud egress to consider).

Given the above background, one method to confirm transient congestion, besides monitoring queue depths (difficult to impossible on ATM cloud side), would be to implement policers with a minimum Tc, to "clock" bandwidth usage. The policer would pass all traffic, but it's stats might show the bursts.

If you don't want to take the time to confirm whether micro bursts are the issue, you can just implement QoS techniques to generally deal with them. For instance, on the branch site set hardware interface queue to minimum (tx-ring-limit 2 or 3) and FQ at the software level. At the hub/HQ, if it has a larger physical bandwidth to the ATM cloud, insure it doesn't overrun branch bandwidth, it too uses FQs (per PVC) and minimize its hardware FIFO.

Done correctly, pings latencies (if the prior is the issue), should be much more consistent regardless of overall link loading. Users should also see more consistent performance, although they will still see performance decrease if the link is loaded (less bandwidth per user).

PS:

BTW, as to number of users, just one can often easily max a T1 that has little latency. Something as simple as openning a large file, e.g. Powerpoint, across a T1 can degrade other users while the data is transfered with FIFO.

Assuming all users are just using SQL, and it's a server variant, this is less likely although large results sets can cause an issue too.